andrew-stein-sp commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226370708

   This is quite the thread...
   
   We really only see the `Airflow Task Timeout Error`'s en masse in our 
biggest/busiest cluster, and I'm with @potiuk in that I'm not convinced its 
_entirely_ Airflow core related. But I also think better tracing in celery 
would help a lot of us diagnose and mitigate issues.
   
   And my recent experience backs this up. We run on EKS, 
CeleryKubernetesExecutor, with AWS managed Redis backend. 
   About 2 months ago the "Task Timeout Error" messages had finally ramped up 
to a level where someone thought it was worth me investigating rather than just 
retrying their dags. The first thing I noticed was an increase in logs like 
this:
   
   ```
   {celery_executor.py:292} INFO - [Try 1 of 3] Task Timeout Error for Task: 
(TaskInstanceKey(dag_id='<redacted>', task_id='job_type', 
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)).
   {celery_executor.py:292} INFO - [Try 2 of 3] Task Timeout Error for Task: 
(TaskInstanceKey(dag_id='<redacted>', task_id='job_type', 
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)).
   {celery_executor.py:292} INFO - [Try 3 of 3] Task Timeout Error for Task: 
(TaskInstanceKey(dag_id='<redacted>', task_id='job_type', 
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)).
   Celery Task ID: TaskInstanceKey(dag_id='<redacted>', task_id='job_type', 
run_id='scheduled__2024-04-28T17:00:00+00:00', try_number=1, map_index=-1)
   {scheduler_job_runner.py:695} INFO - Received executor event with state 
failed for task instance TaskInstanceKey(dag_id='<redacted>', 
task_id='job_type', run_id='scheduled__2024-04-28T17:00:00+00:00', 
try_number=1, map_index=-1)
   {scheduler_job_runner.py:732} INFO - TaskInstance Finished: 
dag_id=<redacted>, task_id=job_type, 
run_id=scheduled__2024-04-28T17:00:00+00:00, map_index=-1, run_start_date=None, 
run_end_date=None, run_duration=None, state=queued, executor_state=failed, 
try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default, 
priority_weight=30, operator=BranchPythonOperator, queued_dttm=2024-04-28 
18:00:01.783823+00:00, queued_by_job_id=57872215, pid=None
   {task_context_logger.py:91} ERROR - Executor reports task instance 
<TaskInstance: <redacted>.job_type scheduled__2024-04-28T17:00:00+00:00 
[queued]> finished (failed) although the task says it's queued. (Info: None) 
Was the task killed externally?
   ```
   
   After quite a bit of digging, I found that our celery workers started 
hitting the AWS EC2 limit of 1024 Packets Per Second Per ENI for DNS queries(we 
were running 3 kube-dns pods). However, it wasn't immediately apparent DNS 
queries(even cluster local dns queries) were failing on the workers. Random 
celery tasks were just timing out for no apparent reason. 
   (If you need to check if DNS is your problem, go here: 
https://repost.aws/knowledge-center/vpc-find-cause-of-failed-dns-queries)
   
   So, I switched us over to `NodeLocalDNS` and these errors have gone way way 
down, but alas, it never completely eliminated them. Maybe Airflow has 
childhood trauma...
   
   Anyway, its that right there that tells me `Airflow Task Timeout Error`'s 
could be caused by a long list of different things, but its a lack of good 
error logs and tracing that prevent us from knowing the true cause.
   
   On that note, i look forward to to the integration of open-telemetry tracing 
in a future version of Airflow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to