Victsm commented on a change in pull request #30062:
URL: https://github.com/apache/spark/pull/30062#discussion_r511051284



##########
File path: 
common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
##########
@@ -363,4 +363,26 @@ public boolean useOldFetchProtocol() {
     return conf.getBoolean("spark.shuffle.useOldFetchProtocol", false);
   }
 
+  /**
+   * The minimum size of a chunk when dividing a merged shuffle file into 
multiple chunks during
+   * push-based shuffle.
+   * A merged shuffle file consists of multiple small shuffle blocks. Fetching 
the
+   * complete merged shuffle file in a single response increases the memory 
requirements for the
+   * clients. Instead of serving the entire merged file, the shuffle service 
serves the
+   * merged file in `chunks`. A `chunk` constitutes few shuffle blocks in 
entirety and this
+   * configuration controls how big a chunk can get. A corresponding index 
file for each merged
+   * shuffle file will be generated indicating chunk boundaries.
+   */
+  public int minChunkSizeInMergedShuffleFile() {
+    return Ints.checkedCast(JavaUtils.byteStringAsBytes(
+      conf.get("spark.shuffle.server.minChunkSizeInMergedShuffleFile", "2m")));

Review comment:
       One thing we discussed internally is whether this config should be a 
server side config or a client side config.
   As @otterc mentioned, there are multiple reasons we break a merged shuffle 
partition file into multiple smaller chunks.
   One of the biggest reasons is to parallelize fetching shuffle data and task 
execution.
   If we have a multi-GB merged shuffle partition and the client is fetching it 
as a single block, then the client would wait until it fetches its entirety 
before handing off to the task processing logic to process the block, which is 
not ideal.
   The question is that whether size of the chunk should be a global 
configuration on the server side, irrespective of individual applications, or a 
Spark app configuration so users can fine tune it.
   We currently make it a server side config so we don't introduce another 
parameter for users to tune.
   Want to get inputs from the community on this as well.
   cc @Ngone51 @attilapiros @jiangxb1987 @tgravescs 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to