ejstembler opened a new issue, #27300: URL: https://github.com/apache/airflow/issues/27300
### Apache Airflow version Other Airflow 2 version (please specify below) ### What happened Airflow version: `v2.3.3+astro.2`. We've encounter this issue twice this year. Something causes the Scheduler to get stuck in an endless loop, yet it shows as healthy even though nothing is being processed. The last time we encounter this issue, this week. The Scheduler encountered a database update error: ``` sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'dag' expected to update 1 row(s); 0 were matched. ``` As a result, the Schedule logs should it's stuck in an endless loop, the same messages are repeating over-and-over.  Because of this, nothing runs, and the entire Airflow instance is considered down. In this particular case, the issue was resolved by manually deleting the duplicate row in the `dag` table. When we encounter a similar case earlier in the year, the root cause was different and required a different solution. (Upsizing workers). ### What you think should happen instead The Scheduler should not crash or get stuck in an endless loop. It should handle exceptional cases gracefully. It should not be reported as healthy if it is crashing continuously or stuck in an endless loop. Some strategies for handling this, off the top of my head: * The Scheduler should have stricter error handling and when an error is encountered, it should log the error, and continue on to the next scheduled DAG. * The Scheduler itself should not be allowed to get into an endless loop. * Check the logs for repeating message patterns? * Keep a count to make sure DAGs are being run? * Use logarithmic or exponential backoff when retrying? ### How to reproduce Enter a duplicate row in the `dags` table. There are probably other ways. Earlier in the year we encounter this same issues when Workers were not properly upsized. ### Operating System Debian GNU/Linux 11 (bullseye) ### Versions of Apache Airflow Providers [apache-airflow-providers-http](https://pypi.python.org/pypi/apache-airflow-providers-http)==2.0.1 [apache-airflow-providers-jdbc](https://pypi.python.org/pypi/apache-airflow-providers-jdbc)==2.0.1 [simple-salesforce](https://pypi.python.org/pypi/simple-salesforce)==1.1.0 [csvvalidator](https://pypi.python.org/pypi/csvvalidator)==1.2 [pandas](https://pypi.python.org/pypi/pandas)==1.3.5 [pre-commit](https://pypi.python.org/pypi/pre-commit) [pylint](https://pypi.python.org/pypi/pylint)==2.15 [pytest](https://pypi.python.org/pypi/pytest)==6.2.5 [pyspark](https://pypi.python.org/pypi/pyspark)==3.3.0 [apache-airflow-providers-google](https://pypi.python.org/pypi/apache-airflow-providers-google)==6.4.0 ### Deployment Astronomer ### Deployment details Astronomer ### Anything else _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
