wolfier commented on issue #40435:
URL: https://github.com/apache/airflow/issues/40435#issuecomment-2274296036

   I'll start with the successful task / trigger execution. The task exited 
right away with the return code. The trigger fired as expected after the 
conditions are met.
   
   ```
   [2024-08-03, 10:35:20 UTC] {{standard_task_runner.py:60}} INFO - Started 
process 3875 to run task
   ...
   [2024-08-03, 10:35:21 UTC] {{taskinstance.py:2344}} INFO - Pausing task as 
DEFERRED. dag_id=eod_jobs, task_id=task_123, execution_date=20240802T001500, 
start_date=20240803T103519
   [2024-08-03, 10:35:21 UTC] {{local_task_job_runner.py:231}} INFO - Task 
exited with return code 100 (task deferral)
   ...
   [2024-08-03, 10:37:23 UTC] {{triggerer_job_runner.py:602}} INFO - Trigger 
eod_jobs/scheduled__2024-08-02T00:15:00+00:00/task_123/-1/2 (ID 5992) fired: 
TriggerEvent<{'status': 'success', 'job_id': 'JOB_UUID'}>
   ```
   
   The failed task instance did not exit therefore does not have a return code. 
Given the logs, the reported behaviour is then expected.
   
   ```
   [2024-08-03, 00:46:56 UTC] {{standard_task_runner.py:60}} INFO - Started 
process 26709 to run task
   ...
   [2024-08-03, 00:47:01 UTC] {{taskinstance.py:2344}} INFO - Pausing task as 
DEFERRED. dag_id=eod_jobs, task_id=task_123, execution_date=20240802T001500, 
start_date=20240803T004656
   ...
   [2024-08-03, 00:47:23 UTC] {{process_utils.py:131}} INFO - Sending 15 to 
group 26709. PIDs of all processes in the group: [26709]
   [2024-08-03, 00:47:24 UTC] {{process_utils.py:86}} INFO - Sending the signal 
15 to group 26709
   [2024-08-03, 00:47:44 UTC] {{process_utils.py:79}} INFO - Process 
psutil.Process(pid=26709, status='terminated', exitcode=100, 
started='00:46:56') (26709) terminated with exit code 100
   ```
   
   Surprisingly, the exit code was 100. 
   
   We know that two consecutive calls of heartbeat_callback saw the 
`task_runner` return code is `None` and the task instance state was in 
`deferred` starting at `00:47:01`. We do not know what happened for 22 seconds.
   
   I suspect one of the following happened:
   * The execute function did complete and `return_code` is set to 100. 
However, the process has not exited. It may have gotten stuck at 
`os._exit(return_code)`. When the process was forced to exit via reaping which 
calls `os.killpg(process_group_id, sig)` , the `returncode` for the process was 
100.
   * The execute function did not complete. Not sure where it could be stuck.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to