jens-scheffler-bosch commented on issue #33282:
URL: https://github.com/apache/airflow/issues/33282#issuecomment-1673809563

   From the logs itself I can confirm in reading time stamps and parallel 
execution is not desired. But with this information it is not possible to 
discover any root cause. As I was also debugging deep in scheduler code I very 
much assume some side effect is causing this. Without understanding the root 
cause it is hard to think about how to fix such concurrency in a distributed 
system w/o a central lock (which is a design feature, not a flaw).
   
   I have some more questions on this, this might help finding the root cause:
   - How many schedulers do you run?
   - How many workers do you run?
   - Which setup, Redis+Celery? Which executors?
   - Are you sure that no DB Backup especially no DB restore was made during 
the error cause?
   - Were any components of Scheduler, Worker, Redis Queue, Database restarted 
in the timeframe of the error?
   - Is it possible to re-produce this?
   - Did it happen once or multiple times?
   - Would it be possible to have logs of all workers and schedulers in the 
time period shared? Would be a longer search for errors but might be the only 
final evidence.
   - What is your [Scheduler] orphaned_tasks_check_interval and 
scheduler_zombie_task_threshold configuration?
   - How long are the affected tasks running (start/end)?
   - Do you have configured any specific [celery_broker_transport_options] 
visibility_timeout options?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to