ashb opened a new pull request #15112:
URL: https://github.com/apache/airflow/pull/15112


   Closes #7935, #15037 (I hope!) 
   
    There have been long standing issues where the scheduler would "stop 
    responding" that we haven't been able to track down.
   
    Someone was able to catch the scheduler in this state in 2.0.1 and inspect
    it with py-spy (thanks, MatthewRBruce!)
   
    The stack traces (slightly shortened) were:
   
    ``` Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler 
    Python v3.8.7 (/usr/local/bin/python3.8) Thread 0x7FF5C09C8740 (active):
    "MainThread"
       _send (multiprocessing/connection.py:368)
       _send_bytes (multiprocessing/connection.py:411)
       send (multiprocessing/connection.py:206)
       send_callback_to_execute (airflow/utils/dag_processing.py:283)
       _send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
       _schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
   
    Process 77: airflow scheduler -- DagFileProcessorManager Python v3.8.7
    (/usr/local/bin/python3.8) Thread 0x7FF5C09C8740 (active): "MainThread"
       _send (multiprocessing/connection.py:368)
       _send_bytes (multiprocessing/connection.py:405)
       send (multiprocessing/connection.py:206)
       _run_parsing_loop (airflow/utils/dag_processing.py:698)
       start (airflow/utils/dag_processing.py:596)
    ```
   
    What this shows is that both processes are stuck trying to send data to 
    each other, but neither can proceed as both buffers are full, but since 
    both are trying to send, neither side is going to read and make more space
    in the buffer. A classic deadlock!
   
    The fix for this is two fold:
   
    1) Enable non-blocking IO on the DagFileProcessorManager side.
   
       The only thing the Manager sends back up the pipe is (now, as of 2.0)
      the DagParsingStat object, and the scheduler will happily continue
      without receiving these, so in the case of a blocking error, it is
      simply better to ignore the error, continue the loop and try sending
      one again later.
   
    2) Reduce the size of DagParsingStat
   
       In the case of a large number of dag files we included the path for
      each and every one (in full) in _each_ parsing stat. Not only did the
      scheduler do nothing with this field, meaning it was larger than it
      needed to be, by making it such a large object, it increases the
      likely hood of hitting this send-buffer-full deadlock case!
   
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to