johnhoran opened a new issue, #41867:
URL: https://github.com/apache/airflow/issues/41867

   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==8.4.1
   
   ### Apache Airflow version
   
   2.9.1
   
   ### Operating System
   
   Linux
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   Originally we noticed this in astronomer in a dag using astronomer-cosmos.  
However I have been able to reliably reproduce the issue in airflow running 
through a python debugger in vscode using just the Kubernetes Pod Operator.
   
   ### What happened
   
   When running a DAG we noticed a task that failed according to the logs 
output from DBT.  However airflow still marked the task as successful and then 
proceeded with the next task.  
   
   In the airflow logs we see a line like 
   ```
   [2024-08-28, 13:30:10 UTC] {triggerer_job_runner.py:631} INFO - Trigger 
my_task_group/manual__2024-08-28T11:39:57.188213+00:00/regional_tickets.base_tickets_run/-1/1
 (ID 711) fired: TriggerEvent<{'status': 'running', 'last_log_time': 
DateTime(2024, 8, 28, 13, 29, 1, 820284, tzinfo=Timezone('UTC')), 'namespace': 
'sidereal-protostar-2231', 'name': 'my-dag-run-2p3iotfr'}>
   ```
   Then logs from DBT that show task has failed followed by
   ```
   2024-08-28, 13:30:56 UTC] {taskinstance.py:352} INFO - Marking task as 
SUCCESS. dag_id=my_dag, task_id=regional_tickets.base_tickets_run, 
run_id=manual__2024-08-28T11:39:57.188213+00:00, 
execution_date=20240828T113957, start_date=20240828T123001, 
end_date=20240828T133056
   ```
   
   ### What you think should happen instead
   
   The task should have failed and retried.
   
   ### How to reproduce
   
   I have created a simple dag like
   ```python
   from airflow.providers.cncf.kubernetes.operators.pod import 
KubernetesPodOperator
   import airflow
   from airflow import DAG
   from datetime import timedelta
   
   with DAG(
           'kubernetes_defer_test',
           default_args={
               'start_date': airflow.utils.dates.days_ago(0),
               'retries': 20,
               'retry_delay': timedelta(seconds=5)
           },
           schedule_interval=None,
           catchup=False,
   ) as dag:
   
       task1 = KubernetesPodOperator(
           name="hello-world",
           image="python:3.11-slim",
           cmds=['python', '-c'],
           arguments=["import time; time.sleep(5); print('hello'); 
time.sleep(30); print('world'); import sys; sys.exit(1)"],
           task_id="dummy_run",
           deferrable=True,
           logging_interval=10,
       )
   ```
   
   And then placing a breakpoint on 
https://github.com/apache/airflow/blob/e8888fe055ac8c341e2fa6631ff6ac5f089d54c5/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L427
 
   I leave it wait here until the pod execution fails, and then when it resumes 
the task will be marked as successful.  
   
   ### Anything else
   
   I hope this is a faithful replication of the bug.  I have seen it just by 
running the dag in an airflow deployment, but it is intermittent until the 
timing lines up correctly.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to