wolfier commented on issue #40435:
URL: https://github.com/apache/airflow/issues/40435#issuecomment-2273833594

   I don't think a task in the deferred state should be counted as running. 
Once a task is deferred, it should exit as tasks do for other terminal states 
like success and failed. For the 
[LocalTaskJobRunner](https://github.com/apache/airflow/blob/2.8.1/airflow/jobs/local_task_job_runner.py#L284-L305)
 to produce the message `external set to deferred`, the task instance must have 
been in the deferred state for at least 
[scheduler_zombie_task_threshold](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#scheduler-zombie-task-threshold)
 seconds (see 
[source](https://github.com/apache/airflow/blob/2.8.1/airflow/jobs/local_task_job_runner.py#L167-L194))
 and the corresponding StandardTaskRunner did not produced a return code during 
that time. The source code provides a helpful note that I think is what 
happened.
   > \# potential race condition, the _run_raw_task commits `success` or other 
state
               # but task_runner does not exit right away due to slow process 
shutdown or any other reasons
   
   The reporter also mentioned in [another 
comment](https://github.com/apache/airflow/pull/40453#issuecomment-2210569574) 
that their scheduler may not well resourced. However, this statement cannot be 
confirmed without the actual resource utilization graph.
   
   > Also, I have a question related to the underlying issue. I've noticed this 
happens if a lot of deferred tasks are submitted at once. The server I'm 
running this on is very small (2 CPU, little memory). I think this is means the 
timing of the heart beat is messed up by a straved CPU.
   
   Regardless, it would beneficial to see what happened to the task after being 
marked as deferred. The task log should tell us the exact timestamp. Compare 
that with when the LocalTaskJobRunner terminated the task, we can at least have 
a clearer view of the task/process lifecycle. I also think the full task log 
would help the investigation as the terminal code may provide additional 
context.
   
   I suspect the issue is something to do with the StandardTaskRunner failing 
to exit for some reason. Considering deferred as a "running" state would 
probably let the LocalTaskJobRunner and StandardTaskRunner run indefinitely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to