Victsm commented on a change in pull request #30062:
URL: https://github.com/apache/spark/pull/30062#discussion_r511051284
##########
File path:
common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
##########
@@ -363,4 +363,26 @@ public boolean useOldFetchProtocol() {
return conf.getBoolean("spark.shuffle.useOldFetchProtocol", false);
}
+ /**
+ * The minimum size of a chunk when dividing a merged shuffle file into
multiple chunks during
+ * push-based shuffle.
+ * A merged shuffle file consists of multiple small shuffle blocks. Fetching
the
+ * complete merged shuffle file in a single response increases the memory
requirements for the
+ * clients. Instead of serving the entire merged file, the shuffle service
serves the
+ * merged file in `chunks`. A `chunk` constitutes few shuffle blocks in
entirety and this
+ * configuration controls how big a chunk can get. A corresponding index
file for each merged
+ * shuffle file will be generated indicating chunk boundaries.
+ */
+ public int minChunkSizeInMergedShuffleFile() {
+ return Ints.checkedCast(JavaUtils.byteStringAsBytes(
+ conf.get("spark.shuffle.server.minChunkSizeInMergedShuffleFile", "2m")));
Review comment:
One thing we discussed internally is whether this config should be a
server side config or a client side config.
As @otterc mentioned, there are multiple reasons we break a merged shuffle
partition file into multiple smaller chunks.
One of the biggest reasons is to parallelize fetching shuffle data and task
execution.
If we have a multi-GB merged shuffle partition and the client is fetching it
as a single block, then the client would wait until it fetches its entirety
before handing off to the task processing logic to process the block, which is
not ideal.
The question is that whether size of the chunk should be a global
configuration on the server side, irrespective of individual applications, or a
Spark app configuration so users can fine tune it.
We currently make it a server side config so we don't introduce another
parameter for users to tune.
Want to get inputs from the community on this as well.
cc @Ngone51 @attilapiros @jiangxb1987 @tgravescs
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]