thesuperzapper commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-1563704852

   @ashb @potiuk and others watching, if you experience this issue __it's that 
your task is taking more than 6 hours__, in most cases your task will continue 
running, but you can't see your logs until the task finishes or fails.
   
   ---
   
   Now, the question is why are we losing the logs after 6 hours?  
   I believe this is because of a slightly complex combination of celery 
configs:
   
   - [`task_acks_late` = 
`True`](https://github.com/apache/airflow/blob/2.6.1/airflow/config_templates/default_celery.py#L51)
   - [`broker_transport_options.visibility_timeout` = 
`21600`](https://github.com/apache/airflow/blob/2.6.1/airflow/config_templates/default_celery.py#L39)
   
   The `visibility_timeout` is the number of seconds celery will wait for an 
acknowledgment before it returns a task to the queue (for it to be picked up by 
another worker). But because `task_acks_late` is `True`, airflow's celery 
workers will not acknowledge tasks until AFTER the task has finished. Hence, if 
a task takes more than `visibility_timeout` seconds to finish, celery will tell 
another worker to start this task.
   
   When this happens, the new instance of this task will realize the task is 
still running on the old worker (it's not failed, and is even heart-beating the 
TaskInstance), and will correctly wait (because a task can not be "running" in 
two places). But now the airflow UI only shows the logs for this new "phantom 
task", which will always be:
   
   ```
   {taskinstance.py:874} INFO - Dependencies not met for <TaskInstance: ... 
[running]>, dependency 'Task Instance Not Running' FAILED: Task is in the 
running state
   {taskinstance.py:874} INFO - Dependencies not met for <TaskInstance: ... 
[running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' 
state which is not a valid state for execution. The task must be cleared in 
order to be run.
   ```
   
   Effectively, this issue is the result of celery taking matters into its own 
hands (outside of the control of the airflow scheduler), and telling a new 
worker to start the task which is still running on the old worker.
   
   In terms of fixes, I am not sure what the implications of changing 
`task_acks_late` to `False` would be, as I can see that it has been set to 
`True` for most of Airflow's existence ([it was set to `True` in this commit 
from 
2015](https://github.com/apache/airflow/commit/8bd122f7b3fbfdd5b7741ded8813ca0b8c4eccae#diff-ac6d6f745ae19450e4bfbd1087d865b5784294354c885136b97df437460d5f10R22)).
 
   
   But with Airflow's ["heartbeat" checks for zombie 
tasks](https://airflow.apache.org/docs/apache-airflow/2.6.1/configurations-ref.html#scheduler-zombie-task-threshold),
 there probably no need to have celery re-issuing tasks, and we might benefit 
from setting `task_acks_late` to `False`, and `visibility_timeout` to a much 
lower number (to detect transport failures in celery).
   
   ----
   
   Finally, I want to highlight that Airflow's docs are confusing about what 
the default `visibility_timeout` is, because [this piece of code will set it to 
`21600`](https://github.com/apache/airflow/blob/2.6.1/airflow/config_templates/default_celery.py#L39),
 when 
[`AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT`](https://airflow.apache.org/docs/apache-airflow/2.6.1/configurations-ref.html#visibility-timeout)
 is `None`, which is different from the celery default of `3600`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to