[
https://issues.apache.org/jira/browse/SPARK-36967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-36967:
------------------------------------
Assignee: Apache Spark
> Report accurate block size threshold per reduce task
> ----------------------------------------------------
>
> Key: SPARK-36967
> URL: https://issues.apache.org/jira/browse/SPARK-36967
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.3.0
> Reporter: Wan Kun
> Assignee: Apache Spark
> Priority: Major
> Attachments: map_status.png, map_status2.png
>
>
> Now map task will report accurate shuffle block size if the block size is
> greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But
> if there are a large number of map tasks and the shuffle block sizes of these
> tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be
> unrecognized data skew.
> For example, there are 10000 map task and 10000 reduce task, and each map
> task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the
> left reduce tasks, reduce 0 is data skew, but the stat of this plan do not
> have this information.
> !map_status2.png!
> I think we need to judge if a shuffle block is huge and need to be accurate
> reported while running.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]