andrew-stein-sp commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226370708
This is quite the thread...
We really only see the `Airflow Task Timeout Error`'s en masse in our
biggest/busiest cluster, and I'm with @potiuk in that I'm not convinced its
_entirely_ Airflow core related. But I also think better tracing in celery
would help a lot of us diagnose and mitigate issues.
And my recent experience backs this up. We run on EKS,
CeleryKubernetesExecutor, with AWS managed Redis backend.
About 2 months ago the "Task Timeout Error" messages had finally ramped up
to a level where someone thought it was worth me investigating rather than just
retrying their dags. The first thing I noticed was an increase in logs like
this:
```
{celery_executor.py:292} INFO - [Try 1 of 3] Task Timeout Error for Task:
(TaskInstanceKey(dag_id='<redacted>', task_id='job_type',
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)).
{celery_executor.py:292} INFO - [Try 2 of 3] Task Timeout Error for Task:
(TaskInstanceKey(dag_id='<redacted>', task_id='job_type',
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)).
{celery_executor.py:292} INFO - [Try 3 of 3] Task Timeout Error for Task:
(TaskInstanceKey(dag_id='<redacted>', task_id='job_type',
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)).
Celery Task ID: TaskInstanceKey(dag_id='<redacted>', task_id='job_type',
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)
{scheduler_job_runner.py:695} INFO - Received executor event with state
failed for task instance TaskInstanceKey(dag_id='<redacted>',
task_id='job_type', run_id='scheduled__2024-04-28T17:00:00+00:00',
try_number=1, map_index=-1)
{scheduler_job_runner.py:732} INFO - TaskInstance Finished:
dag_id=<redacted>, task_id=job_type,
run_id=scheduled__2024-04-28T17:00:00+00:00, map_index=-1, run_start_date=None,
run_end_date=None, run_duration=None, state=queued, executor_state=failed,
try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default,
priority_weight=30, operator=BranchPythonOperator, queued_dttm=2024-04-28
18:00:01.783823+00:00, queued_by_job_id=57872215, pid=None
{task_context_logger.py:91} ERROR - Executor reports task instance
<TaskInstance: <redacted>.job_type scheduled__2024-04-28T17:00:00+00:00
[queued]> finished (failed) although the task says it's queued. (Info: None)
Was the task killed externally?
```
After quite a bit of digging, I found that our celery workers started
hitting the AWS EC2 limit of 1024 Packets Per Second Per ENI for DNS queries(we
were running 3 kube-dns pods). However, it wasn't immediately apparent DNS
queries(even cluster local dns queries) were failing on the workers. Random
celery tasks were just timing out for no apparent reason.
(If you need to check if DNS is your problem, go here:
https://repost.aws/knowledge-center/vpc-find-cause-of-failed-dns-queries)
So, I switched us over to `NodeLocalDNS` and these errors have gone way way
down, but alas, it never completely eliminated them. Maybe Airflow has
childhood trauma...
Anyway, its that right there that tells me `Airflow Task Timeout Error`'s
could be caused by a long list of different things, but its a lack of good
error logs and tracing that prevent us from knowing the true cause.
On that note, i look forward to to the integration of open-telemetry tracing
in a future version of Airflow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]