ashb opened a new pull request #15112:
URL: https://github.com/apache/airflow/pull/15112
Closes #7935, #15037 (I hope!)
There have been long standing issues where the scheduler would "stop
responding" that we haven't been able to track down.
Someone was able to catch the scheduler in this state in 2.0.1 and inspect
it with py-spy (thanks, MatthewRBruce!)
The stack traces (slightly shortened) were:
``` Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.7 (/usr/local/bin/python3.8) Thread 0x7FF5C09C8740 (active):
"MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:411)
send (multiprocessing/connection.py:206)
send_callback_to_execute (airflow/utils/dag_processing.py:283)
_send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
_schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
Process 77: airflow scheduler -- DagFileProcessorManager Python v3.8.7
(/usr/local/bin/python3.8) Thread 0x7FF5C09C8740 (active): "MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:405)
send (multiprocessing/connection.py:206)
_run_parsing_loop (airflow/utils/dag_processing.py:698)
start (airflow/utils/dag_processing.py:596)
```
What this shows is that both processes are stuck trying to send data to
each other, but neither can proceed as both buffers are full, but since
both are trying to send, neither side is going to read and make more space
in the buffer. A classic deadlock!
The fix for this is two fold:
1) Enable non-blocking IO on the DagFileProcessorManager side.
The only thing the Manager sends back up the pipe is (now, as of 2.0)
the DagParsingStat object, and the scheduler will happily continue
without receiving these, so in the case of a blocking error, it is
simply better to ignore the error, continue the loop and try sending
one again later.
2) Reduce the size of DagParsingStat
In the case of a large number of dag files we included the path for
each and every one (in full) in _each_ parsing stat. Not only did the
scheduler do nothing with this field, meaning it was larger than it
needed to be, by making it such a large object, it increases the
likely hood of hitting this send-buffer-full deadlock case!
<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.
Feel free to ping committers for the review!
In case of existing issue, reference it using one of the following:
closes: #ISSUE
related: #ISSUE
How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
---
**^ Add meaningful description above**
Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
for more information.
In case of fundamental code change, Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]