[GitHub] [spark] ulysses-you commented on a change in pull request #33310: [SPARK-36105][SQL] OptimizeLocalShuffleReader support reading data of multiple mappers in one task

GitBox Tue, 20 Jul 2021 19:20:31 -0700


ulysses-you commented on a change in pull request #33310:
URL: https://github.com/apache/spark/pull/33310#discussion_r673612589




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala
##########
@@ -181,6 +187,9 @@ class ShuffledRowRDD(
 
       case PartialMapperPartitionSpec(mapIndex, _, _) =>
         tracker.getMapLocation(dependency, mapIndex, mapIndex + 1)
+
+      case CoalescedMapperPartitionSpec(startMapIndex, endMapIndex, 
numReducers) =>
+        tracker.getMapLocation(dependency, startMapIndex, endMapIndex)

Review comment:
       I see this can reduce the partition number, but I'm not sure this 
approach has benefits of perf. The origin idea of `OptimizeLocalShuffleReader` 
is make reducer task run at the same executor of target mapper so that it can 
reduce some network IO.
   
   I think order by partition size only solve the issue partially. If we want 
to coalesce mappers, shall we check the coalesced mappers are at the same 
executor or node ?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on a change in pull request #33310: [SPARK-36105][SQL] OptimizeLocalShuffleReader support reading data of multiple mappers in one task

Reply via email to