wolfier commented on issue #40435: URL: https://github.com/apache/airflow/issues/40435#issuecomment-2273833594
I don't think a task in the deferred state should be counted as running. Once a task is deferred, it should exit as tasks do for other terminal states like success and failed. For the [LocalTaskJobRunner](https://github.com/apache/airflow/blob/2.8.1/airflow/jobs/local_task_job_runner.py#L284-L305) to produce the message `external set to deferred`, the task instance must have been in the deferred state for at least [scheduler_zombie_task_threshold](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#scheduler-zombie-task-threshold) seconds (see [source](https://github.com/apache/airflow/blob/2.8.1/airflow/jobs/local_task_job_runner.py#L167-L194)) and the corresponding StandardTaskRunner did not produced a return code during that time. The source code provides a helpful note that I think is what happened. > \# potential race condition, the _run_raw_task commits `success` or other state # but task_runner does not exit right away due to slow process shutdown or any other reasons The reporter also mentioned in [another comment](https://github.com/apache/airflow/pull/40453#issuecomment-2210569574) that their scheduler may not well resourced. However, this statement cannot be confirmed without the actual resource utilization graph. > Also, I have a question related to the underlying issue. I've noticed this happens if a lot of deferred tasks are submitted at once. The server I'm running this on is very small (2 CPU, little memory). I think this is means the timing of the heart beat is messed up by a straved CPU. Regardless, it would beneficial to see what happened to the task after being marked as deferred. The task log should tell us the exact timestamp. Compare that with when the LocalTaskJobRunner terminated the task, we can at least have a clearer view of the task/process lifecycle. I also think the full task log would help the investigation as the terminal code may provide additional context. I suspect the issue is something to do with the StandardTaskRunner failing to exit for some reason. Considering deferred as a "running" state would probably let the LocalTaskJobRunner and StandardTaskRunner run indefinitely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
