wjddn279 commented on PR #65943:
URL: https://github.com/apache/airflow/pull/65943#issuecomment-4442215679

   @diogosilva30 
   Thanks for reply!
   
   > On the "gradually decreases" question, I see where you're coming from, but 
the problem here isn't plugin imports in user code. The warning fires on every 
single fork throughout the worker's lifetime because supervisor.py itself keeps 
a thread pool alive for async log pushing (aiofiles/anyio). Add GCS credential 
refreshes and Secret Manager calls on every task completion and you've got live 
threads on basically every fork. The race doesn't go away after warmup.
   
   Fair point. I was only thinking about the import-lock case, but it sounds 
like the locking issues span much more broadly across threads than just imports.
   
   > What about multiprocessing.forkserver?
   
   After looking into how forkserver actually works, **it looks like a better 
fit than a fork pool for a multi-threaded environment like the edge worker**. 
There is no need to write many code to apply it.
   
   Spinning up the server before the async cycle starts and forking from there 
seems like the right shape.
   
   > Leaks, stale connections, open fds all accumulate across task runs. Right 
now every task gets a clean process and that isolation earns its keep.
   
   The forked process doesn't do anything beyond running supervise_task. Any 
connections or fds it creates are explicitly closed. (There were a few leaks in 
practice, but I've fixed most of them by now.)
   
   > Plugin hot reload via git-sync breaks until a worker recycles since 
long-lived workers hold onto their imported code.
   
   Same here — by design the worker doesn't import user code directly. All user 
code is loaded only inside the short-lived process forked by supervise_task.
   
   > It touches the core execution model (dispatcher, queueing, lifecycle, 
supervision). Way bigger blast radius than this PR warrants.
   
   Agreed. There's reference code we could lean on, but the diff is still 
substantial — you'd need a Queue for IPC and liveness checks layered on top.
   
   > If a pool worker dies mid-task, you have to detect it, restart it, and 
reconcile with the Edge API. Per-task processes just... handle that.
   
   This was actually my biggest concern with the pool approach. If a worker 
dies you have to restart it, but there's no guarantee the restarted process is 
lock-safe either. **That's exactly where forkserver looks like the strong 
option**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to