[I] KubernetesPodTrigger emits timeout for running pods with unstable Triggerer [airflow]

via GitHub Sun, 02 Jun 2024 21:52:15 -0700


Usiel opened a new issue, #40018:
URL: https://github.com/apache/airflow/issues/40018


   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   8.3.0 (latest main)
   
   ### Apache Airflow version
   
   2.9.1
   
   ### Operating System
   
   -
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   We recently encountered a case where our Triggerer deployments were under 
heavy load and the trigger execution was severely delayed. This caused the 
following behavior to emerge:
   
   1. A deferrable KubernetesPodOperator is started and deferred
   2. `startup_timeout_seconds` (or more) seconds pass
   3. A Triggerer finally gets to `run()` the trigger and emits a `timeout` 
event, even if the pod is running or already completed successfully
   4. Task fails due to the timeout
   
   The same behavior can be provoked by another likely case: Assume the 
Triggerer running the trigger crashes or restarts after the 
`startup_timeout_seconds` has been reached and the pod has left pending state 
already. The new Triggerer process will execute `run()` and emit a timeout 
event because it sees the timeout has been reached (it doesn't check for the 
pod state).
   
   ### What you think should happen instead
   
   The pod state should be checked, only if it is still pending **and** the 
timeout has been reached should a timeout event be emitted.
   
   ### How to reproduce
   
   Use the following DAG on an Airflow deployment with Triggerer:
   
   ```
   from datetime import datetime
   
   from airflow import DAG
   from airflow.providers.cncf.kubernetes.operators.pod import 
KubernetesPodOperator
   from airflow.providers.cncf.kubernetes.utils.pod_manager import 
OnFinishAction
   
   with DAG(
           dag_id="kubernetes_trigger_bug",
           schedule=None,
           start_date=datetime(2021, 1, 1),
   ) as dag:
       k = KubernetesPodOperator(
           task_id="kubernetes_task_async",
           namespace="default",
           image="alpine:latest",
           cmds=["sh", "-c"],
           arguments=["sleep 60"],
           name="airflow-test-pod",
           deferrable=True,
           startup_timeout_seconds=10,
           on_finish_action=OnFinishAction.DELETE_SUCCEEDED_POD,
       )
   ```
   
   Now, you can simulate a Triggerer crash while the DAG is running:
   
   1. Trigger the DAG
   2. Wait 10 seconds (but less than 60 seconds)
   3. Kill the Triggerer process that handles the trigger (the new process will 
emit a timeout event)
   4. Wait for the pod to complete
   5. Task will fail due to the timeout
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] KubernetesPodTrigger emits timeout for running pods with unstable Triggerer [airflow]

Reply via email to