JCoder01 commented on issue #13099:
URL: https://github.com/apache/airflow/issues/13099#issuecomment-748034037


   I'm still trying to work through this and am slowing moving dags over from 
the original environment where the error occured to the clean environment. 
While I can't seem seem to get the database into the troubled state on it's 
own, I can force it into that state by stopping the scheduler, and running the 
below. Then when you start the scheduler, you get the pk violation. 
   ```
   update dag set next_dagrun = (select max(execution_date) from dag_run where 
dag_id = 'example_task_group')
   where dag_id = 'example_task_group2' 
   ```
   I think what is happening is if you have a very small number of dags, in the 
time it takes for the scheduler to throw the error, one of the parser processes 
updates the backend db with the correct `next_dagrun` and starting the 
scheduler again works fine. As the number of dags grow, the chances that the 
problematic dag gets updated before the scheduler shuts down due the pk 
violation decreases, so the error persists until the you are lucky enough to 
get the problematic dag parsed.
   So while it's not clear how the database can get _into_ this state, would it 
make sense to add some "self healing" in the scheduler start up to reparse all 
the dags on startup? Or maybe rather than bailing there is some error handling 
in the scheduler that if a pk violation does arise, the dag is reparsed and 
tried to be scheduled again?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to