SakshamKapoor2911 commented on issue #67870: URL: https://github.com/apache/airflow/issues/67870#issuecomment-4597288328
Taking this on. This appears to be a Python `multiprocessing` queue deadlock. Because `end()` calls `proc.join()` before draining the remaining items in `result_queue`, workers completing tasks block indefinitely on `put()` once the OS-level pipe buffer fills up, while the parent scheduler process blocks on `join()`. My proposed fix: 1. **Draining during join:** Prevent the deadlock in `end()` by continuously calling `_read_results()` in a loop with a bounded `proc.join(timeout=0.5)` to ensure the pipe buffer is kept clear while waiting for workers to exit. 2. **Resilience:** Add resilience to `_read_results()` to catch and log `OSError`/`EOFError` instead of letting a broken worker pipe propagate an unhandled exception, ensuring that the normal reaping mechanism (`_check_workers()`) handles the worker exit gracefully. 3. **Forced Terminate:** Implement `terminate()` to forcefully stop workers in case of immediate shutdown. Will be working on this immediately as I see it's labeled `priority: high` until I've reproduced the pipe-fill deadlock and confirmed the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
