ejstembler opened a new issue, #27300:
URL: https://github.com/apache/airflow/issues/27300

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   Airflow version: `v2.3.3+astro.2`.
   
   We've encounter this issue twice this year.  Something causes the Scheduler 
to get stuck in an endless loop, yet it shows as healthy even though nothing is 
being processed.
   
   The last time we encounter this issue, this week. The Scheduler encountered 
a database update error:
   
   ```
   sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'dag' expected 
to update 1 row(s); 0 were matched.
   ```
   
   As a result, the Schedule logs should it's stuck in an endless loop, the 
same messages are repeating over-and-over.
   
   
![Screen_Shot_2022-10-24_at_10_25_21_AM](https://user-images.githubusercontent.com/45985338/198082454-7afce5e9-81c6-4f0a-9509-f99d591ede3e.png)
   
   Because of this, nothing runs, and the entire Airflow instance is considered 
down.
   
   In this particular case, the issue was resolved by manually deleting the 
duplicate row in the `dag` table.
   
   When we encounter a similar case earlier in the year, the root cause was 
different and required a different solution. (Upsizing workers).
   
   ### What you think should happen instead
   
   The Scheduler should not crash or get stuck in an endless loop.  It should 
handle exceptional cases gracefully. It should not be reported as healthy if it 
is crashing continuously or stuck in an endless loop.
   
   Some strategies for handling this, off the top of my head:
   
   * The Scheduler should have stricter error handling and when an error is 
encountered, it should log the error, and continue on to the next scheduled DAG.
   * The Scheduler itself should not be allowed to get into an endless loop.
     * Check the logs for repeating message patterns?
     * Keep a count to make sure DAGs are being run?
     * Use logarithmic or exponential backoff when retrying?
   
   
   ### How to reproduce
   
   Enter a duplicate row in the `dags` table.  There are probably other ways.  
Earlier in the year we encounter this same issues when Workers were not 
properly upsized.
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   
[apache-airflow-providers-http](https://pypi.python.org/pypi/apache-airflow-providers-http)==2.0.1
   
[apache-airflow-providers-jdbc](https://pypi.python.org/pypi/apache-airflow-providers-jdbc)==2.0.1
   [simple-salesforce](https://pypi.python.org/pypi/simple-salesforce)==1.1.0
   [csvvalidator](https://pypi.python.org/pypi/csvvalidator)==1.2
   [pandas](https://pypi.python.org/pypi/pandas)==1.3.5
   [pre-commit](https://pypi.python.org/pypi/pre-commit)
   [pylint](https://pypi.python.org/pypi/pylint)==2.15
   [pytest](https://pypi.python.org/pypi/pytest)==6.2.5
   [pyspark](https://pypi.python.org/pypi/pyspark)==3.3.0
   
[apache-airflow-providers-google](https://pypi.python.org/pypi/apache-airflow-providers-google)==6.4.0
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   Astronomer
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to