otterc commented on a change in pull request #32007:
URL: https://github.com/apache/spark/pull/32007#discussion_r622429509
##########
File path: core/src/main/scala/org/apache/spark/storage/BlockId.scala
##########
@@ -87,6 +87,29 @@ case class ShufflePushBlockId(shuffleId: Int, mapIndex: Int,
reduceId: Int) exte
override def name: String = "shufflePush_" + shuffleId + "_" + mapIndex +
"_" + reduceId
}
+@DeveloperApi
+case class ShuffleMergedBlockId(appId: String, shuffleId: Int, reduceId: Int)
extends BlockId {
+ override def name: String = "mergedShuffle_" + appId + "_" + shuffleId + "_"
+ reduceId + ".data"
+}
+
+@DeveloperApi
+case class ShuffleMergedIndexBlockId(
+ appId: String,
+ shuffleId: Int,
+ reduceId: Int) extends BlockId {
+ override def name: String =
+ "mergedShuffle_" + appId + "_" + shuffleId + "_" + reduceId + ".index"
Review comment:
Below are my comments for each solution
1. Create a new `RegisterExecutor` message. If we do this, we might also
want to add the recent "merge_manager_attemptx` name. A benefit of that would
be to remove the logic on the server which tries to find the merge_manager
directory path by finding the parent of blockMgrs directory.
2. Encoding `attemptId` in the `appId` of `RegisterExecutor` message. The
old `ExternalShuffleBlockResolver` needs to know the `appId` (without
attemptId) when the executors are registered and the `executors` map is
populated, since all the `get...BlockData()` apis reference this map with only
`appId` and `executorId`. We can't change these apis. `RemoteBlockPushResolver`
may also need to know the `appId` and `attemptId` separately to avoid changes
to multiple data-structures. It would be then better to handle this parsing in
`ExternalBlockHandler`.
3. Have the `RemoteBlockPushResolver` figure out the attemptId just by
itself by listing the directory and finding the latest attempt. With this one
every time an executor registers, the shuffle server has to do the following:
- find the parent directory from blockManager dirs.
- list the merge_manager_* dirs
- figure out the latest attempt.
All these 3 steps above are avoided in solution 2 for every executor
registration because we know the attempt Id from the message itself and it will
ignore the ones belonging to the current one. The additional cost that is
incurred in solution 2 is just splitting the `appId` and `attemptId` which I
don't think is that much.
I think Solution 1 is the cleanest but I can see that it adds another
message and attemptId may be just specific to Yarn. Between 2 and 3, I prefer
solution 2.
WDYT @tgravescs @mridulm @attilapiros @Ngone51 @Victsm @zhouyejoe
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]