[GitHub] [spark] mridulm commented on a change in pull request #34156: [WIP] [SPARK-36892] [Core] Disable batch fetch for a shuffle when push based shuffle is enabled

GitBox Tue, 05 Oct 2021 18:55:40 -0700


mridulm commented on a change in pull request #34156:
URL: https://github.com/apache/spark/pull/34156#discussion_r722826200




##########
File path: core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala
##########
@@ -734,4 +738,112 @@ class MapOutputTrackerSuite extends SparkFunSuite with 
LocalSparkContext {
       tracker.stop()
     }
   }
+
+  test("SPARK-36892: Batch fetch should be enabled in some scenarios with push 
based shuffle") {
+    conf.set(PUSH_BASED_SHUFFLE_ENABLED, true)
+    conf.set(IS_TESTING, true)

Review comment:
       Can you rebase to latest master ? (to include Minchu's patch ?)
   Specifically, we need `conf.set(SERIALIZER, 
"org.apache.spark.serializer.KryoSerializer")` also here to enable push based 
shuffle.
   (Same for other tests below as well).

##########
File path: docs/configuration.md
##########
@@ -3166,7 +3166,7 @@ See the `RDD.withResources` and `ResourceProfileBuilder` 
API's for using this fe
 
 # Push-based shuffle overview
 
-Push-based shuffle helps improve the reliability and performance of spark 
shuffle. It takes a best-effort approach to push the shuffle blocks generated 
by the map tasks to remote external shuffle services to be merged per shuffle 
partition. Reduce tasks fetch a combination of merged shuffle partitions and 
original shuffle blocks as their input data, resulting in converting small 
random disk reads by external shuffle services into large sequential reads. 
Possibility of better data locality for reduce tasks additionally helps 
minimize network IO.
+Push-based shuffle helps improve the reliability and performance of spark 
shuffle. It takes a best-effort approach to push the shuffle blocks generated 
by the map tasks to remote external shuffle services to be merged per shuffle 
partition. Reduce tasks fetch a combination of merged shuffle partitions and 
original shuffle blocks as their input data, resulting in converting small 
random disk reads by external shuffle services into large sequential reads. 
Possibility of better data locality for reduce tasks additionally helps 
minimize network IO. Push-based shuffle takes priority over batch fetch for 
some scenarios, like partition coalesce.

Review comment:
       `like partition coalesce when merged output is available` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a change in pull request #34156: [WIP] [SPARK-36892] [Core] Disable batch fetch for a shuffle when push based shuffle is enabled

Reply via email to