squito commented on issue #22806: [SPARK-25250][CORE] : Late zombie task completions handled correctly even before new taskset launched URL: https://github.com/apache/spark/pull/22806#issuecomment-459767222 uggh ... I just realized a problem. Everything in TaskSetManager is supposed to be protected by a lock on `TaskSchedulerImpl`, but now `markPartitionCompleted()` is getting called without that lock. The old version guaranteed that, because it was called from within a method in the TSM where we already had that lock. But now, we've moved that out to the DAGSchedulerEventLoop, where you do not necessarily have that lock. I haven't thought through the consequences of this -- without proper updates to `tasksSuccessful` and `successful`, what could go wrong? maybe we'd never mark the taskset as finished? will this lead to failures, or just inefficiencies? I'm not sure how to fix this. You could make `TaskSchedulerImpl.markPartitionCompletedInAllTaskSets()` synchronized, but then you're getting a lock on the taskSchedulerImpl in the DAGScheduler event loop. That's not good for scheduling throughput, and also want to make sure there is no change of deadlock. I feel like there should be a better solution ...
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
