xkrogen opened a new pull request #31936:
URL: https://github.com/apache/spark/pull/31936


   ### What changes were proposed in this pull request?
   Add a new config, `spark.shuffle.service.name`, which allows for Spark 
applications to look for a YARN shuffle service which is defined at a name 
other than the default `spark_shuffle`.
   
   Add a new config, `spark.yarn.shuffle.service.metrics.namespace`, which 
allows for configuring the namespace used when emitting metrics from the 
shuffle service into the NodeManager's `metrics2` system.
   
   Add a new mechanism by which to override shuffle service configurations 
independently of the configurations in the NodeManager. When a resource 
`spark-shuffle-site.xml` is present on the classpath of the shuffle service, 
the configs present within it will be used to override the configs coming from 
`yarn-site.xml` (via the NodeManager).
   
   ### Why are the changes needed?
   There are two use cases which can benefit from these changes.
   
   One use case is to run multiple instances of the shuffle service 
side-by-side in the same NodeManager. This can be helpful, for example, when 
running a YARN cluster with a mixed workload of applications running multiple 
Spark versions, since a given version of the shuffle service is not always 
compatible with other versions of Spark (e.g. see SPARK-27780). With this PR, 
it is possible to run two shuffle services like `spark_shuffle` and 
`spark_shuffle_3.2.0`, one of which is "legacy" and one of which is for new 
applications. This is possible because YARN versions since 2.9.0 support the 
ability to run shuffle services within an isolated classloader (see YARN-4577), 
meaning multiple Spark versions can coexist.
   
   Besides this, the separation of shuffle service configs into 
`spark-shuffle-site.xml` can be useful for administrators who want to change 
and/or deploy Spark shuffle service configurations independently of the 
configurations for the NodeManager (e.g., perhaps they are owned by two 
different teams).
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. There are two new configurations related to the external shuffle 
service, and a new mechanism which can optionally be used to configure the 
shuffle service. `docs/running-on-yarn.md` has been updated to provide user 
instructions; please see this guide for more details.
   
   ### How was this patch tested?
   In addition to the new unit tests added, I have deployed this to a live YARN 
cluster and successfully deployed two Spark shuffle services simultaneously, 
one running a modified version of Spark 2.3.0 (which supports some of the newer 
shuffle protocols) and one running Spark 3.1.1. Spark applications of both 
versions are able to communicate with their respective shuffle services without 
issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to