mridulm commented on a change in pull request #32007:
URL: https://github.com/apache/spark/pull/32007#discussion_r624054924
##########
File path: core/src/main/scala/org/apache/spark/storage/BlockId.scala
##########
@@ -87,6 +87,29 @@ case class ShufflePushBlockId(shuffleId: Int, mapIndex: Int,
reduceId: Int) exte
override def name: String = "shufflePush_" + shuffleId + "_" + mapIndex +
"_" + reduceId
}
+@DeveloperApi
+case class ShuffleMergedBlockId(appId: String, shuffleId: Int, reduceId: Int)
extends BlockId {
+ override def name: String = "mergedShuffle_" + appId + "_" + shuffleId + "_"
+ reduceId + ".data"
+}
+
+@DeveloperApi
+case class ShuffleMergedIndexBlockId(
+ appId: String,
+ shuffleId: Int,
+ reduceId: Int) extends BlockId {
+ override def name: String =
+ "mergedShuffle_" + appId + "_" + shuffleId + "_" + reduceId + ".index"
Review comment:
@Ngone51 Moving the management to executor is a nice idea - If we are
moving the delete to the executor, then we could formulate it to make the
change minimal right ?
In executor:
a) List application/merge_manager_*
b) Delete all merge_manager_* directories [1] which are older than the
current attempt id [2]
b.1) If there are newer merge_manager_* directories - it exit's : since this
is not the latest attempt for this app (should be vanishingly small chance -
but adding for completeness sake).
c) Create merge_manager_$ATTEMPT_ID, and register with existing rpc's.
In shuffle service:
d) Shuffle service simply looks up the latest merge_manager_* (which should
ideally be a single directory), and registers application/executor with that
attempt.
d.1) If earlier attempts exist, cleanup metadata - directory would be
cleaned up already anyway.
Thoughts ?
[1] Given possibility of concurrent executors deleting, we should be robust
to the errors there.
[2] In theory, there could be race with a newer version - but should not
happen.
- that is, if merge_manager_attemptId
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]