venkata91 commented on a change in pull request #33615:
URL: https://github.com/apache/spark/pull/33615#discussion_r687896314
##########
File path: docs/configuration.md
##########
@@ -3152,3 +3152,109 @@ The stage level scheduling feature allows users to
specify task and executor res
This is only available for the RDD API in Scala, Java, and Python. It is
available on YARN and Kubernetes when dynamic allocation is enabled. See the
[YARN](running-on-yarn.html#stage-level-scheduling-overview) page or
[Kubernetes](running-on-kubernetes.html#stage-level-scheduling-overview) page
for more implementation details.
See the `RDD.withResources` and `ResourceProfileBuilder` API's for using this
feature. The current implementation acquires new executors for each
`ResourceProfile` created and currently has to be an exact match. Spark does
not try to fit tasks into an executor that require a different ResourceProfile
than the executor was created with. Executors that are not in use will idle
timeout with the dynamic allocation logic. The default configuration for this
feature is to only allow one ResourceProfile per stage. If the user associates
more then 1 ResourceProfile to an RDD, Spark will throw an exception by
default. See config `spark.scheduler.resource.profileMergeConflicts` to control
that behavior. The current merge strategy Spark implements when
`spark.scheduler.resource.profileMergeConflicts` is enabled is a simple max of
each resource within the conflicting ResourceProfiles. Spark will create a new
ResourceProfile with the max of each of the resources.
+
+# Push-based shuffle overview
+
+Push based shuffle helps improve the reliability and performance of spark
shuffle. It takes a best-effort approach to push the shuffle blocks generated
by the map tasks to remote shuffle services to be merged per shuffle partition.
Reduce tasks fetch a combination of merged shuffle partitions and original
shuffle blocks as their input data, resulting in converting small random disk
reads by shuffle services into large sequential reads. Possibility of better
data locality for reduce tasks additionally helps minimize network IO.
+
+<p> Push-based shuffle improves performance for long running jobs/queries
which involves large disk I/O during shuffle. Currently it is not well suited
for jobs/queries which runs quickly dealing with lesser amount of shuffle data.
This will be further improved in the future releases.</p>
+
+<p> <b> Currently push-based shuffle is only supported for Spark on YARN with
external shuffle service. </b></p>
+
+### Shuffle server side configuration options
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since
Version</th></tr>
+<tr>
+ <td><code>spark.shuffle.push.server.mergedShuffleFileManagerImpl</code></td>
+ <td>
+ <code>org.apache.spark.network.shuffle.<br
/>NoOpMergedShuffleFileManager</code>
+ </td>
+ <td>
+ Class name of the implementation of MergedShuffleFileManager that manages
push-based shuffle. This acts as a server side config to disable or enable
push-based shuffle. By default, push-based shuffle is disabled at the server
side. <p> To enable push-based shuffle on the server side, set this config to
<code>org.apache.spark.network.shuffle.RemoteBlockPushResolver</code></p>
+ </td>
+ <td>3.2.0</td>
+</tr>
+<tr>
+
<td><code>spark.shuffle.push.server.minChunkSizeInMergedShuffleFile</code></td>
+ <td><code>2m</code></td>
+ <td>
+ <p> The minimum size of a chunk when dividing a merged shuffle file into
multiple chunks during push-based shuffle. A merged shuffle file consists of
multiple small shuffle blocks. Fetching the complete merged shuffle file in a
single disk I/O increases the memory requirements for both the clients and the
external shuffle service. Instead, the shuffle service serves the merged file
in <code>MB-sized chunks</code>.<br /> This configuration controls how big a
chunk can get. A corresponding index file for each merged shuffle file will be
generated indicating chunk boundaries. </p>
+ <p> Setting this too high would increase the memory requirements on both
the clients and the external shuffle service. </p>
+ <p> Setting this too low would increase the overall number of RPC requests
to external shuffle service unnecessarily.</p>
+ </td>
+ <td>3.2.0</td>
+</tr>
+<tr>
+ <td><code>spark.shuffle.push.server.mergedIndexCacheSize</code></td>
+ <td><code>100m</code></td>
+ <td>
+ The size of cache in memory which is used in push-based shuffle for
storing merged index files. This cache is in addition to the one configured via
<code>spark.shuffle.service.index.cache.size</code>.
+ </td>
+ <td>3.2.0</td>
+</tr>
+</table>
+
+### Client side configuration options
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since
Version</th></tr>
+<tr>
+ <td><code>spark.shuffle.push.enabled</code></td>
+ <td><code>false</code></td>
+ <td>
+ Set to 'true' to enable push-based shuffle on the client side and works in
conjunction with the server side flag
spark.shuffle.server.mergedShuffleFileManagerImpl.
+ </td>
+ <td>3.2.0</td>
+</tr>
+<tr>
+ <td><code>spark.shuffle.push.merge.finalize.timeout</code></td>
Review comment:
@Ngone51 2 reasons:
1. It can be slightly confusing to configure both `finalize.timeout` as well
as `results.timeout` for users.
2. Also `results.timeout` is the upper bound, if the merge results arrive
sooner than the upper bound (10s default) then driver just proceeds further
scheduling the other stages. But that is not the case with `finalize.timeout`
which is a fixed wait time before `finalize` is called on all the external
shuffle services. This would also eventually become an upper bound once we
address SPARK-33701. Hope that clarifies it a bit.
Let me know your thoughts. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]