dan-origami edited a comment on issue #16703:
URL: https://github.com/apache/airflow/issues/16703#issuecomment-909191767


    @kaxil @ephraimbuddy Just to give you a bit of an update, I think I have 
found what the actual cause of this is.
   
   I noticed that we seem to hit a problem with the number of Active Tasks on a 
Celery Worker (all our settings here are currently default) so max 16 per 
worker and 32 across the airflow setup.
   
   However I noticed that when this problem manifests we don't schedule 
anything so started looking into our workers via Flower.
   
   Screenshots are below, but basically we have these fairly big DAGs that run 
some spark jobs on a spark cluster in the same kubernetes cluster (pyspark, so 
the driver exists as part of the airflow worker, we can do more details on 
spark if you want but its BashOperator and not SparkOperator for a number of 
reasons).
   
   Sometimes the tasks in these DAGs fail, which is picked up by Airflow as the 
task is marked as Failed. However these tasks sit on the Celery worker as an 
active task still and are not removed.
   
   We can manually delete them and it works, so the celery worker itself is 
still active and not crashed. They just do not seem to log anything when they 
are not picking up any new tasks/running them. Active PIDs etc as listed in 
Flower also seem to match up.
   
   It's not clear why the Task failed but  we have the logs of it being picked 
up by the Worker (i've removed a few bits).
   
   It also explains why I was down the memory/resource issue rabbithole as 
these tasks sit around on the worker(s).
   
   There are some parameters that we can tune I think to include timeouts on 
the tasks and stuff on the Celery side, do you know if there is any known 
issues with this disconnect between a Failed Task in Airflow and it not being 
removed from the Celery Worker?
   
   The worker was not rebooted/crashed at any point during this time.
   
   ```
   [2021-08-28 06:54:43,865: INFO/ForkPoolWorker-8] Executing command in 
Celery: ['airflow', 'tasks', 'run', 'dagname', 'timeseries_api', 
'2021-08-28T06:18:00+00:00', '--local', '--pool', 'default_pool', '--subdir', 
'dagfolder']
   [2021-08-28 06:54:44,122: WARNING/ForkPoolWorker-8] Running 
<TaskInstance:dagname.timeseries_api 2021-08-28T06:18:00+00:00 [queued]> on 
host airflow-worker-2.airflow-worker.airflow.svc.cluster.local
   ```
   
   <img width="579" alt="Screenshot 2021-08-31 at 13 16 40" 
src="https://user-images.githubusercontent.com/51330721/131501535-c74321ef-04a1-46ba-b260-032ac52089f0.png";>
   
   <img width="1666" alt="Screenshot 2021-08-31 at 13 17 18" 
src="https://user-images.githubusercontent.com/51330721/131503343-f463ccea-f8f4-45cb-a50d-d9f02b3ba2dc.png";>
   
   <img width="1101" alt="Screenshot 2021-08-31 at 13 18 21" 
src="https://user-images.githubusercontent.com/51330721/131503356-2aafb596-e90e-4a66-b0a4-aaf73073c91d.png";>
   
   <img width="1666" alt="Screenshot 2021-08-31 at 13 18 03" 
src="https://user-images.githubusercontent.com/51330721/131503371-ca565fe7-9bb2-43d2-89fc-2929f89c71d4.png";>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to