Victsm commented on a change in pull request #30062:
URL: https://github.com/apache/spark/pull/30062#discussion_r512886244
##########
File path:
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
##########
@@ -94,6 +95,9 @@
static final String STOP_ON_FAILURE_KEY = "spark.yarn.shuffle.stopOnFailure";
private static final boolean DEFAULT_STOP_ON_FAILURE = false;
+ // Used by shuffle merge manager to create merged shuffle files.
+ protected static final String APP_BASE_RELATIVE_PATH =
"usercache/%s/appcache/%s/";
Review comment:
This goes back to the conversation we had in #29855
(https://github.com/apache/spark/pull/29855/files#r497077157)
The assumption right now is that we can use appId and user Id to properly
distinguish the local dirs given different cluster schedulers.
In the case of YARN, this information is provided through the following API
during application registration:
https://github.com/apache/spark/blob/f284218dae23bf91e72e221943188cdb85e13dac/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/MergedShuffleFileManager.java#L58-L66
During executor registration, the `blockManagerDir` local dir paths
information is further provided to `RemoteBlockPushResolver` through the
following API:
https://github.com/apache/spark/blob/f284218dae23bf91e72e221943188cdb85e13dac/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/MergedShuffleFileManager.java#L68-L77
Since block merge dirs are per-application instead of per-executor, we only
pick the local dir information for the first executor of a given application
registered with the shuffle service.
Having the common dir path pattern provided by `YarnShuffleService` to
`RemoteBlockPushResolver` would then allow `RemoteBlockPushResolver` to
construct the local dirs used for storing merged shuffle files from the
`blockManagerDir` local dirs provided during executor registration.
As @otterc mentioned, we don't really need the common dir path pattern if we
just store the full local dir paths in the hashmap.
This would also simplify the application registration API in
`MergeShuffleFileManager` to remove the user ID field, which could make it more
applicable to other cluster schedulers.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]