Wan Kun created SPARK-36967:
-------------------------------

             Summary: Update accurate block size threshold per reduce task
                 Key: SPARK-36967
                 URL: https://issues.apache.org/jira/browse/SPARK-36967
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.3.0
            Reporter: Wan Kun
         Attachments: map_status.png

Now map task will report accurate shuffle block size if the block size is 
greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if 
there are many map tasks and shuffle block sizes of these tasks are less than 
"spark.shuffle.accurateBlockThreshold", there may be data skew, but not 
recognized.


For example, there are 10000 map task and 10000 reduce task, and each task has 
50M for reduce 0, and 10K for the left reduce tasks, reduce 0 is data skew, but 
the stat of this plan do not have this message. 



I think we need to judge if a shuffle block is huge and need to be accurate 
reported while running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to