Ngone51 commented on pull request #30650: URL: https://github.com/apache/spark/pull/30650#issuecomment-767433788
Hi @mridulm @tgravescs sorry for the delay. After times thinking, I think we should just keep the original behavior for the barrier taskset with the legacy delay scheduling. That means we should still abort the taskset and throw an exception when tasks are partially launched in that case. Think about a case under the legacy delay scheduling, saying we have 2 tasks for barrier taskset and one task prefers executor-0 and another task has not preferred locations. On the other hand, we only have the resources (executor-0, host-0), (executor-1, host-1) for each resourceOffers. Then, within each resourceOffers, one task can always get scheduled at executor-0 first and **reset** the timer and current locality to PROCESS_LOCAL. And then, of course, another task can get scheduled at PROCESS_LOCAL. And if we try the next resourceOffer, we still can not launch the whole taskset since we'd start from the locality PROCESS_LOCAL again. Therefore, we'd never have a chance to get the barrier taskset launched. Non-legacy delay scheduling doesn't have this issue because it only resets when all tasks get launched in a single resourceOffer round. That means the locality level will go up (from local to ANY) as time goes by until we launched the taskset successfully. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
