jens-scheffler-bosch commented on issue #33282: URL: https://github.com/apache/airflow/issues/33282#issuecomment-1673809563
From the logs itself I can confirm in reading time stamps and parallel execution is not desired. But with this information it is not possible to discover any root cause. As I was also debugging deep in scheduler code I very much assume some side effect is causing this. Without understanding the root cause it is hard to think about how to fix such concurrency in a distributed system w/o a central lock (which is a design feature, not a flaw). I have some more questions on this, this might help finding the root cause: - How many schedulers do you run? - How many workers do you run? - Which setup, Redis+Celery? Which executors? - Are you sure that no DB Backup especially no DB restore was made during the error cause? - Were any components of Scheduler, Worker, Redis Queue, Database restarted in the timeframe of the error? - Is it possible to re-produce this? - Did it happen once or multiple times? - Would it be possible to have logs of all workers and schedulers in the time period shared? Would be a longer search for errors but might be the only final evidence. - What is your [Scheduler] orphaned_tasks_check_interval and scheduler_zombie_task_threshold configuration? - How long are the affected tasks running (start/end)? - Do you have configured any specific [celery_broker_transport_options] visibility_timeout options? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
