wolfier commented on issue #40435:
URL: https://github.com/apache/airflow/issues/40435#issuecomment-2274296036
I'll start with the successful task / trigger execution. The task exited
right away with the return code. The trigger fired as expected after the
conditions are met.
```
[2024-08-03, 10:35:20 UTC] {{standard_task_runner.py:60}} INFO - Started
process 3875 to run task
...
[2024-08-03, 10:35:21 UTC] {{taskinstance.py:2344}} INFO - Pausing task as
DEFERRED. dag_id=eod_jobs, task_id=task_123, execution_date=20240802T001500,
start_date=20240803T103519
[2024-08-03, 10:35:21 UTC] {{local_task_job_runner.py:231}} INFO - Task
exited with return code 100 (task deferral)
...
[2024-08-03, 10:37:23 UTC] {{triggerer_job_runner.py:602}} INFO - Trigger
eod_jobs/scheduled__2024-08-02T00:15:00+00:00/task_123/-1/2 (ID 5992) fired:
TriggerEvent<{'status': 'success', 'job_id': 'JOB_UUID'}>
```
The failed task instance did not exit therefore does not have a return code.
Given the logs, the reported behaviour is then expected.
```
[2024-08-03, 00:46:56 UTC] {{standard_task_runner.py:60}} INFO - Started
process 26709 to run task
...
[2024-08-03, 00:47:01 UTC] {{taskinstance.py:2344}} INFO - Pausing task as
DEFERRED. dag_id=eod_jobs, task_id=task_123, execution_date=20240802T001500,
start_date=20240803T004656
...
[2024-08-03, 00:47:23 UTC] {{process_utils.py:131}} INFO - Sending 15 to
group 26709. PIDs of all processes in the group: [26709]
[2024-08-03, 00:47:24 UTC] {{process_utils.py:86}} INFO - Sending the signal
15 to group 26709
[2024-08-03, 00:47:44 UTC] {{process_utils.py:79}} INFO - Process
psutil.Process(pid=26709, status='terminated', exitcode=100,
started='00:46:56') (26709) terminated with exit code 100
```
Surprisingly, the exit code was 100.
We know that two consecutive calls of heartbeat_callback saw the
`task_runner` return code is `None` and the task instance state was in
`deferred` starting at `00:47:01`. We do not know what happened for 22 seconds.
I suspect one of the following happened:
* The execute function did complete and `return_code` is set to 100.
However, the process has not exited. It may have gotten stuck at
`os._exit(return_code)`. When the process was forced to exit via reaping which
calls `os.killpg(process_group_id, sig)` , the `returncode` for the process was
100.
* The execute function did not complete. Not sure where it could be stuck.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]