CodingCat commented on code in PR #3569:
URL: https://github.com/apache/celeborn/pull/3569#discussion_r2646595242
##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -1041,7 +1043,7 @@ class LifecycleManager(val appUniqueId: String, val conf:
CelebornConf) extends
// So if a barrier stage is getting reexecuted, previous
stage/attempt needs to
// be cleaned up as it is entirely unusuable
if (determinate && !isBarrierStage &&
!isCelebornSkewShuffleOrChildShuffle(
- appShuffleId)) {
+ appShuffleId) && !conf.clientShuffleEarlyDeletion) {
Review Comment:
we cannot reuse the shuffle id when this feature is turned on, think about
the following
stage B.0 depends on shuffle 1 which was written by stage A.0
due to "too early deletion", shuffle 1 id is lost, we need to run A.1 , now
, shuffle 1 has been deleted from "registered shuffle" , if we reuse 1 as the
id and send to tasks of A.1, we will fall into errors like "shuffle not
registered"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]