benrifkind commented on issue #28206:
URL: https://github.com/apache/airflow/issues/28206#issuecomment-1343447057

   Hi @uranusjr. Thanks for your response.
   
   I believe this is an issue with the CeleryExecutor so I have not tested it 
with an other Executors.
   
   I checked and this doesn't seem to be an issue specifically with 
KubernetesPodOperator. I was able to replicate it with the BashOperator.
   
   In terms of number of DAGs and tasks. I was able to replicate this with one 
DAG with many tasks. I think the issue occurs when a celery worker goes down 
unexpectedly while it is still responsible for running tasks. So with one 
celery worker, running one DAG with a lot of tasks and high concurrency creates 
the problem. Basically the celery worker dies or is killed and once it comes 
back up the scheduler thinks the tasks are being run so it can't rerun them on 
this restarted worker. Of course I'm not sure that is what is happening but 
it's my best guess. I am not sure why restarting the scheduler after clearing 
the tasks fixes the issue. 
   
   Thanks for your help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to