potiuk commented on PR #54089:
URL: https://github.com/apache/airflow/pull/54089#issuecomment-3154213860

   Lookgs good. But while we are at it. Maybe (suggestion - we do not have to 
do it now or maybe even at all, but maybe worth considering since you are 
moving timeout to task-sdk. Maybe we should implement additional fork here to 
handle timeout for long running badly written C code.
   
   This only. really affect the "task" running - so while we have now separate 
'task-only" implementaiton of Timeout, we could implement it here.
   
   We discussed it before https://github.com/apache/airflow/issues/53337 (and 
there were earlier discussions on that) - that the best option would be if 
timeout is handled in supervisor, but we really "can't" do it easily because it 
would effectively require the supervisor to parse the DAG file (catch-22). 
   
   But running alarm in the task interpreter has two drawbacks:
   
   * less important -> it does not allow to run code that implements SIGALARM 
handling on their own (for example some timout handlers in Pytest use SIGALRM 
for test timeouts, but I can imagine other libraries doing it internally
   
   * more important -> badly written C code that does not check for alarms 
periodically might cause the task to run forever - without respecting the 
timeout. SIGALRM is fired, passed to the long-running C-code and it waits until 
the c code checks for it. This caused a number of issues raised to us in the 
past.
   
   Now - supervisor route might be not viable, but, we could have additional 
fork here to **just** handle timeout and nothing else. I.e. (pseudocode):
   
   ```
   child_pid = fork()
   if child:
       run_task()
   else:
     setproctitle("timeut....")
      join(chid_pid, timeout)
      if child_pid still running:
           kill_child_pid_possibly_escalating_SIGTERM_SIGKILL
   ```
   
   That is quite a bit simpler in implementation (no context manager, no 
alarm/signal handling AT ALLl, 100% resiliance against long running C code.
   
   There is a LITTLE memory overhead. While Python is not "fork-friendly" due 
to reference counting, in this case it would entirely work because the 
"timeout" process would do literally NOTHING (except setting proctitle) after 
it forks the child - which means that the memory used by the "timeout" process 
would not be "copied on write" because no write would ever occur to any of the 
memory used by python objects.
   
   Effectively every task that needs timeout would be three processes:
   
   ```
   supervisor -> timeout -> task 
   ```
   
   I think that is a cleaner and more resilient. The alarm handling is also 
somewhat obscure. such if/child/join is way cleaner and straightforward, less 
code, less boilerplate. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to