mridulm commented on a change in pull request #34156:
URL: https://github.com/apache/spark/pull/34156#discussion_r722826200
##########
File path: core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala
##########
@@ -734,4 +738,112 @@ class MapOutputTrackerSuite extends SparkFunSuite with
LocalSparkContext {
tracker.stop()
}
}
+
+ test("SPARK-36892: Batch fetch should be enabled in some scenarios with push
based shuffle") {
+ conf.set(PUSH_BASED_SHUFFLE_ENABLED, true)
+ conf.set(IS_TESTING, true)
Review comment:
Can you rebase to latest master ? (to include Minchu's patch ?)
Specifically, we need `conf.set(SERIALIZER,
"org.apache.spark.serializer.KryoSerializer")` also here to enable push based
shuffle.
(Same for other tests below as well).
##########
File path: docs/configuration.md
##########
@@ -3166,7 +3166,7 @@ See the `RDD.withResources` and `ResourceProfileBuilder`
API's for using this fe
# Push-based shuffle overview
-Push-based shuffle helps improve the reliability and performance of spark
shuffle. It takes a best-effort approach to push the shuffle blocks generated
by the map tasks to remote external shuffle services to be merged per shuffle
partition. Reduce tasks fetch a combination of merged shuffle partitions and
original shuffle blocks as their input data, resulting in converting small
random disk reads by external shuffle services into large sequential reads.
Possibility of better data locality for reduce tasks additionally helps
minimize network IO.
+Push-based shuffle helps improve the reliability and performance of spark
shuffle. It takes a best-effort approach to push the shuffle blocks generated
by the map tasks to remote external shuffle services to be merged per shuffle
partition. Reduce tasks fetch a combination of merged shuffle partitions and
original shuffle blocks as their input data, resulting in converting small
random disk reads by external shuffle services into large sequential reads.
Possibility of better data locality for reduce tasks additionally helps
minimize network IO. Push-based shuffle takes priority over batch fetch for
some scenarios, like partition coalesce.
Review comment:
`like partition coalesce when merged output is available`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]