rafidka edited a comment on issue #13824:
URL: https://github.com/apache/airflow/issues/13824#issuecomment-1055701303
I did some dive deep on this issue and root caused it.
**TL;DR** It is related to `watchtower` and has been fixed in version 2.0.0
and later versions. Airflow 2.2.4 now uses that version (see #19907). So, if
you are using it, you should be good. If not, then you need to force install
`watchtower` version 2.0.0 or later. Notice, however, that `watchtower` made
some changes to the `CloudWatchLogHandler` `__init__ ` method so you need to
update the relevant code.
---
The issue is related to a combination of three factors: forking + threading
+ logging. This combination can lead to a deadlock when logs are being flushed
after a task finishes execution. This means that the StandardTaskRunner will be
stuck at [this
line](https://github.com/apache/airflow/blob/10023fdd65fa78033e7125d3d8103b63c127056e/airflow/task/task_runner/standard_task_runner.py#L92).
Now, since the task has actually finished (thus its state is success), but the
process didn't yet exit (and will never since it is in a deadlock state), the
`heartbeat_callback` will end up
[thinking](https://github.com/apache/airflow/blob/10023fdd65fa78033e7125d3d8103b63c127056e/airflow/jobs/local_task_job.py#L186)
that the task state was set externally and issue the following warning and
then SIGTERM the task:
```
State of this instance has been externally set to success. Terminating
instance.
```
Notice that this could cause further issues:
- **Could not read remote logs from log_group**: This error happens when we
don’t have the necessary log data in CloudWatch. This can easily happen with a
task that writes logs at the end which gets interrupted by the SIGTERM and thus
no log is published.
- **Celery command failed on host**: Obviously, when a SIGTERM is sent, the
process will exit with a non-zero code, and Airflow ends up generating this
error
[here](https://github.com/apache/airflow/blob/10023fdd65fa78033e7125d3d8103b63c127056e/airflow/executors/celery_executor.py#L98).
- **Tasks executing multiple times**: In case a Celery Executor + SQS is
used (as in Amazon MWAA for example), and since Airflow uses Celery's
[`ack_late`
feature](https://docs.celeryproject.org/en/stable/reference/celery.app.task.html#celery.app.task.Task.acks_late)
(see
[here](https://github.com/apache/airflow/blob/10023fdd65fa78033e7125d3d8103b63c127056e/airflow/config_templates/default_celery.py#L43)),
a SIGTERM signal will result in the task message not to be deleted from the
SQS queue, and thus after its timeout, it will go back to the queue and will be
picked again by another worker.
---
## References
- https://github.com/kislyuk/watchtower/pull/139
- https://github.com/kislyuk/watchtower/issues/141
- https://bugs.python.org/issue6721
- https://bugs.python.org/issue40089
- https://bugs.python.org/issue874900
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]