karakanb commented on issue #37041:
URL: https://github.com/apache/airflow/issues/37041#issuecomment-2036844725

   I think I see the same behavior on one of our clusters with Airflow v2.8.0. 
It's pretty hard to debug/reproduce, it happened with SnowflakeOperator for us. 
I have `retries: 1`, but there's no logs for the second execution, only the 
first one, and our `on_failure_callback` is triggered.
   
   When I look at the `task_fail` table I see two entries:
   |id   |task_id|dag_id|run_id                              
|map_index|start_date                       |end_date                         
|duration|
   
|-----|-------|------|------------------------------------|---------|---------------------------------|---------------------------------|--------|
   |10146|my_task|my_dag|scheduled__2024-04-03T02:00:00+00:00|-1       
|2024-04-04 02:00:15.312803 +00:00|2024-04-04 02:01:01.365676 +00:00|46      |
   |10149|my_task|my_dag|scheduled__2024-04-03T02:00:00+00:00|-1       
|2024-04-04 02:00:15.312803 +00:00|2024-04-04 02:01:01.679737 +00:00|46      |
   
   Beware that they have the same start date, but different end dates.
   
   When I look at the `log` table, I see another interesting situation:
   |id   |dttm   |dag_id|task_id                             |map_index|event   
                         |execution_date                   |owner|extra         
                                                                                
                                 |owner_display_name|
   
|-----|-------|------|------------------------------------|---------|---------------------------------|---------------------------------|-----|-------------------------------------------------------------------------------------------------------------------------------|------------------|
   |624522|2024-04-04 02:00:04.150609 +00:00|my_dag|my_task                     
        |         |cli_task_run                     |                           
      |airflow|{"host_name": "my-airflow-worker-5584f657cc-r6d62", 
"full_command": "['/home/airflow/.local/bin/airflow', 'celery', 'worker']"}|    
              |
   |624601|2024-04-04 02:00:15.386536 +00:00|my_dag|my_task                     
        |-1       |running                          |2024-04-03 02:00:00.000000 
+00:00|ownerteam|                                                               
                                                                |               
   |
   |624604|2024-04-04 02:00:15.513710 +00:00|my_dag|my_task                     
        |         |cli_task_run                     |                           
      |airflow|{"host_name": "my-airflow-worker-5584f657cc-r6d62", 
"full_command": "['/home/airflow/.local/bin/airflow', 'celery', 'worker']"}|    
              |
   |624747|2024-04-04 02:01:01.366162 +00:00|my_dag|my_task                     
        |-1       |failed                           |2024-04-03 02:00:00.000000 
+00:00|ownerteam|                                                               
                                                                |               
   |
   |624750|2024-04-04 02:01:01.679905 +00:00|my_dag|my_task                     
        |-1       |failed                           |2024-04-03 02:00:00.000000 
+00:00|ownerteam|                                                               
                                                                |               
   |
   |624756|2024-04-04 02:01:03.810450 +00:00|my_dag|my_task                     
        |-1       |failed                           |2024-04-03 02:00:00.000000 
+00:00|ownerteam|                                                               
                                                                |               
   |
                                                                                
                                  |       |
   
   Not sure if I interpret this correctly, but it seems like there's a single 
run, but 3 failure records, each with a slightly different timestamp.
   
   The primary hint I have on my end is that this happened during a time when 
the worker was under memory pressure, and it got killed soon after these logs, 
which signals that there's sth in the worker code that doesn't handle failures 
gracefully, and causes the scheduler to not schedule follow-ups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to