AutomationDev85 opened a new pull request, #67115:
URL: https://github.com/apache/airflow/pull/67115

   # Overview
   
   Some Edge workers intermittently fail with the following error:
   
   `2026-05-17T09:39:34.839980Z [error    ] Task execution failed          
[airflow.providers.edge3.cli.worker] loc=worker.py:226
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/edge3/cli/worker.py",
 line 213, in _run_job_via_supervisor
       supervise(
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 2107, in supervise
       exit_code = process.wait()
                   ^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 1062, in wait
       self._monitor_subprocess()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 1127, in _monitor_subprocess
       alive = self._service_subprocess(max_wait_time=max_wait_time) is None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 791, in _service_subprocess
       events = self.selector.select(timeout=timeout)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/python/lib/python3.12/selectors.py", line 468, in select
       fd_event_list = self._selector.poll(timeout, max_ev)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   ValueError: I/O operation on closed epoll object`
   
   After analyzing the issue, we found that after the socket cleanup timeout 
fires and _cleanup_open_sockets() closes the selector, the monitor loop was not 
exited. On the next iteration it called selector.select() on the already-closed 
epoll object, causing the ValueError.
   
   # Details of change:
   
   - Add a break to exit the monitor loop immediately after the forced socket 
cleanup, preventing any further calls to selector.select() on the closed 
selector.
   - Call _open_sockets.clear() inside _cleanup_open_sockets() to keep the 
socket registry consistent with the selector state after cleanup.
   - Adapt unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to