abhipalsingh commented on PR #68708:
URL: https://github.com/apache/airflow/pull/68708#issuecomment-4741775521

   Adding a data point from the API server side, since this fork behavior isn't 
only a scheduler concern.
   
   On Airflow 3.2.1 with apache-airflow-providers-openlineage==2.14.0 
(pre-#65677), manual task-instance state changes via the REST
   API (mark success/failed/skipped, clear) fire on_task_instance_* on the 
api-server — a multithreaded async process 
   (FastAPI/uvicorn under gunicorn). The listener's _fork_execute calls 
os.fork() from that multithreaded worker; a fraction of 
   children deadlock immediately on an inherited lock (py-spy showed them 
parked in futex_wait_queue, never reaching the post-fork setproctitle) and are 
never reaped → ~350–400 MB each → unbounded accumulation → api-server OOM.
   
   #65677 helps the manual-state-change path (routes it through the 
ProcessPoolExecutor instead of a raw fork), but (a) that still forks a pool 
from the multithreaded async server, and (b) the natural-lifecycle handlers 
still use use_fork=True. So thread-based emission (this issue) is the cleaner 
fit for async contexts like the api-server, where os.fork() is fundamentally 
   unsafe.
   
   We worked around it by disabling OpenLineage on the api-server (no transport 
configured there anyway), but big +1 for 
   execute_in_thread as the general fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to