JCoder01 commented on issue #13099: URL: https://github.com/apache/airflow/issues/13099#issuecomment-748034037
I'm still trying to work through this and am slowing moving dags over from the original environment where the error occured to the clean environment. While I can't seem seem to get the database into the troubled state on it's own, I can force it into that state by stopping the scheduler, and running the below. Then when you start the scheduler, you get the pk violation. ``` update dag set next_dagrun = (select max(execution_date) from dag_run where dag_id = 'example_task_group') where dag_id = 'example_task_group2' ``` I think what is happening is if you have a very small number of dags, in the time it takes for the scheduler to throw the error, one of the parser processes updates the backend db with the correct `next_dagrun` and starting the scheduler again works fine. As the number of dags grow, the chances that the problematic dag gets updated before the scheduler shuts down due the pk violation decreases, so the error persists until the you are lucky enough to get the problematic dag parsed. So while it's not clear how the database can get _into_ this state, would it make sense to add some "self healing" in the scheduler start up to reparse all the dags on startup? Or maybe rather than bailing there is some error handling in the scheduler that if a pk violation does arise, the dag is reparsed and tried to be scheduled again? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
