thinkharderdev opened a new issue, #585:
URL: https://github.com/apache/arrow-ballista/issues/585

   **Describe the bug**
   A clear and concise description of what the bug is.
   
   In scenarios where multiple schedulers are running concurrently it is 
possible to run into the following scenario:
   
   1. Job A gets submitted to scheduler A and is scheduled on all available 
task slots. 
   2. Job B gets submitted to scheduler B and there are no available task slots 
for scheduling.
   3. All task updates from Job A go back to scheduler A. It can not schedule 
any tasks for Job B (because that job is owned by scheduler B) 
   4. Because no task updates land on scheduler B, Job B will never be 
scheduled anywhere. 
   
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   1. Start a cluster with two schedulers
   2. Submit a job to scheduler 1 that consumes all available executor slots
   3. Before any task on job 1 complete, submit a job to scheduler 2
   4. Job 2 will never run
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   Job 2 should start running whenever executor task slots become available
   
   **Additional context**
   Add any other context about the problem here.
   
   The fix here is simple. In the event loop, if a job is submitted and there 
are not task slots available, resubmit the job to the event loop (with a small 
delay to prevent excessive CPU consumption). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to