dirrao commented on PR #32249:
URL: https://github.com/apache/airflow/pull/32249#issuecomment-1614196394

   Hi @dstandish,
   
   Problem: 
   
   When scheduler creates a worker pod for the task, it attaches a label to the 
pod. This label is airflow-worker=<scheduler_job_id>. This label is a unique 
identifier that indicates which scheduler is tracking this worker pod.
   Each scheduler will keep listening for events for only pods that it started. 
So for watching the pod events, it uses kubernetes watch api. This watch is 
done only for the condition: label_selector: airflow-worker=<my_job_id> So 
because of this each scheduler will listen to their own pod events.
   
   So now the following is happening:
   
   1. Lets say scheduler1 with id 1 started a task in worker pod with label 
`airflow-worker=1` and its heartbeat got delayed.
   2. Now scheduler2 with id 2 thinks that scheduler1 is dead because its 
heartbeat was not received on time (even though scheduler 1 is alive).
   3. So scheduler2 will adopt tasks of scheduler1. This means scheduler2 will 
update the label for the pod with scheduler2's job id i.e label is now updated 
from airflow-worker=1 to airflow-worker=2.
   4. In the case of label updation, Kubernetes watch API is sending a DELETE 
event on airflow-worker=1 and ADDED event on airflow-worker=2 . 
   5. As scheduler1 is still alive, it will keep listening to all events with 
label airflow-worker=1. It gets the event as DELETED as explained in point 4. 
So scheduler1 thinks that POD is deleted while it was in running phase and goes 
ahead and runs the task failure scenario. One of the step in this is to do pod 
cleanup which is to delete the pod to avoid dangling pods. So it sends a delete 
request to kubernetes API.
   
   Solution: Change the airflow Kubernetes watch label selector filter from 
kwargs = {"label_selector": f"airflow-worker={scheduler_job_id}"} to  kwargs = 
{"label_selector": "airflow-worker"} and then filter the events in scheduler by 
airflow-worker=<my_job_id>.
   
   QA: I have updated the test case to ensure the existing functionality. 
However, I am not sure how to write the test cases in case multi scheduler. Can 
you share references to it?  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to