attilapiros opened a new pull request #31513:
URL: https://github.com/apache/spark/pull/31513


   
   ### What changes were proposed in this pull request?
   
   This PR modifies the POD allocator to use the scheduler backend to get the 
known executors and remove those from the pending and newly created list.
   
   ### Why are the changes needed?
   
   Because there is race between executor POD allocator and cluster scheduler 
backend.
   Running several experiment during downscaling we experienced a lot of killed 
fresh executors wich has already running task on them.
   
   The pattern in the log was the following (see executor 312 and TID 2079):
   
   ```
   21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new 
total is 138)
   ...
   21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 
2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
   21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests 
(408,312,307).
   ...
   21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 
100.100.18.138: The executor with id 312 was deleted by a user or the framework.
   21/02/01 15:12:04 INFO TaskSetManager: Task 2079 failed because while it was 
being computed, its executor exited for a reason unrelated to the task. Not 
counting this failure towards the maximum number of failures for the task.
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   #### Manually
   
   With this change there was no executor lost with running task on it.  
   
   ##### With unit test
   
   An existing test  is modified to check this case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to