[
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mridul Muralidharan reassigned SPARK-46512:
-------------------------------------------
Assignee: Chenyu Zheng
> Optimize shuffle reading when both sort and combine are used.
> -------------------------------------------------------------
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, Spark Core
> Affects Versions: 4.0.0
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Minor
> Labels: pull-request-available
>
> After the shuffle reader obtains the block, it will first perform a combine
> operation, and then perform a sort operation. It is known that both combine
> and sort may generate temporary files, so the performance may be poor when
> both sort and combine are used. In fact, combine operations can be performed
> during the sort process, and we can avoid the combine spill file.
>
> I did not find any direct api to construct the shuffle which both sort and
> combine is used. But I can do like following code, here is a wordcount, and
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]