nclaeys opened a new issue, #41436:
URL: https://github.com/apache/airflow/issues/41436

   ### Apache Airflow version
   
   2.9.3
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   We had several sensors that failed to be rescheduled by the scheduler 
because it still thought that the worker tasks were running.
   
   The root cause was that the scheduler missed an update event from worker 
task because the Kubernetes node, where the Airflow worker pod was running on, 
got deleted soon after the worker finished successful. This does not follow the 
assumption in the code that a delete of the worker is only issued by Airflow 
itself. The wrong code is in `kubernetes_executor_utils.py`:
   
   ```
               if (
                   event["type"] == "DELETED"
                   or POD_EXECUTOR_DONE_KEY in pod.metadata.labels
                   or pod.metadata.deletion_timestamp
               ):
                   self.log.info(
                       "Skipping event for Succeeded pod %s - event for this 
pod already sent to executor",
                       pod_name,
                   )
                   return
   ```
   In the logs we see only skipping event messages for the worker pods instead 
of first an event that was processed.
   
   Because the sensor was registered to be rescheduled and the task got 
requeued. 
   At that time the scheduler never scheduled the task as it thought there was 
still one running and logged the following:
   
   `INFO - queued but still running;`
   
   The only way to fix it was to restart the scheduler as then the internal 
state of the scheduler was in sync with kubernetes.
   
   
   ### What you think should happen instead?
   
   The fundamental problem is that the watcher for events on kubernetes pods 
skipped the successful event of the worker.
   
   In order to make sure we process all events there are 2 options:
   - remove the skipping of events as it is wrong in certain race conditions. I 
investigated when it was added and it was not really added for a reason, just 
as a behind the scene optimisation as far as I can tell. 
https://github.com/apache/airflow/pull/30872
   - if you want to handle this reliably in Kubernetes you should use 
finalizers for your worker pods. This way you can guarantee that you do not 
miss any events and can use the finalizer to make sure you only process success 
events once. 
   
   ### How to reproduce
   
   The issue is difficult to reproduce reliably. We do notice it on our huge 
production from time to time. 
   It is however easy to see that the code is wrong in certain edge cases 
   
   ### Operating System
   
   kubernetes: apache/airflow:slim-2.9.3-python3.11
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==8.0.1
   apache-airflow-providers-common-io==1.3.2
   apache-airflow-providers-common-sql==1.14.2
   apache-airflow-providers-fab==1.2.2
   apache-airflow-providers-ftp==3.10.0
   apache-airflow-providers-http==4.12.0
   apache-airflow-providers-imap==3.6.1
   apache-airflow-providers-opsgenie==4.0.0
   apache-airflow-providers-postgres==5.11.2
   apache-airflow-providers-slack==7.3.2
   apache-airflow-providers-smtp==1.7.1
   apache-airflow-providers-sqlite==3.8.1
   
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   Kubernetes deployment
   
   ### Anything else?
   
   /
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to