ericpollmann opened a new issue #21082:
URL: https://github.com/apache/airflow/issues/21082


   ### Apache Airflow version
   
   2.2.3 (latest released)
   
   ### What happened
   
   DAGs running on Airflow 2.2.3 run without issue in normal conditions, but 
when scheduler gets heavily loaded, it hangs with the following error:
   
   <img width="1345" alt="Screen Shot 2022-01-24 at 7 57 43 PM" 
src="https://user-images.githubusercontent.com/7079390/150908258-2a60b98f-8d35-4a71-9d96-06d8efa5b3f7.png";>
   
   In this condition the scheduler hung and did not schedule or run any more 
tasks, causing scheduled pipelines to back up for hours until detected by 
monitoring and resolved by human intervention (restarting the scheduler seems 
to work).
   
   We were able to reproduce the error locally and dump op.params - it was 
equal to the config that the DAG run was triggered with - a standard python 
dictionary with string keys and values.
   
   ### What you expected to happen
   
   No DAG serialization error and scheduler does not hang.
   
   ### How to reproduce
   
   Unfortunately this was challenging: there were no errors or hangs during low 
load or normal conditions, they only appeared when the scheduler was very 
heavily loaded (i.e. many thousands of DAG runs per hour)
   
   This was reproducible under Kubernetes (Debian GNU/Linux 10 (buster)) and 
locally (MacOS)
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Single scheduler instance, 4-8 cores the issue was easily reproducible with 
our load (thousands of DAG runs per hour) but with increased scheduler 
resources (3 instances of similar size) the issue was not easy to reproduct 
(error only flickers in briefly, doesn't hang).
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to