SakshamKapoor2911 commented on issue #67870:
URL: https://github.com/apache/airflow/issues/67870#issuecomment-4597288328

   Taking this on.
   
   This appears to be a Python `multiprocessing` queue deadlock. Because 
`end()` calls `proc.join()` before draining the remaining items in 
`result_queue`, workers completing tasks block indefinitely on `put()` once the 
OS-level pipe buffer fills up, while the parent scheduler process blocks on 
`join()`.
   
   My proposed fix:
   1. **Draining during join:** Prevent the deadlock in `end()` by continuously 
calling `_read_results()` in a loop with a bounded `proc.join(timeout=0.5)` to 
ensure the pipe buffer is kept clear while waiting for workers to exit.
   2. **Resilience:** Add resilience to `_read_results()` to catch and log 
`OSError`/`EOFError` instead of letting a broken worker pipe propagate an 
unhandled exception, ensuring that the normal reaping mechanism 
(`_check_workers()`) handles the worker exit gracefully.
   3. **Forced Terminate:** Implement `terminate()` to forcefully stop workers 
in case of immediate shutdown.
   
   Will be working on this immediately as I see it's labeled `priority: high` 
until I've reproduced the pipe-fill deadlock and confirmed the fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to