potiuk commented on PR #54089: URL: https://github.com/apache/airflow/pull/54089#issuecomment-3154213860
Lookgs good. But while we are at it. Maybe (suggestion - we do not have to do it now or maybe even at all, but maybe worth considering since you are moving timeout to task-sdk. Maybe we should implement additional fork here to handle timeout for long running badly written C code. This only. really affect the "task" running - so while we have now separate 'task-only" implementaiton of Timeout, we could implement it here. We discussed it before https://github.com/apache/airflow/issues/53337 (and there were earlier discussions on that) - that the best option would be if timeout is handled in supervisor, but we really "can't" do it easily because it would effectively require the supervisor to parse the DAG file (catch-22). But running alarm in the task interpreter has two drawbacks: * less important -> it does not allow to run code that implements SIGALARM handling on their own (for example some timout handlers in Pytest use SIGALRM for test timeouts, but I can imagine other libraries doing it internally * more important -> badly written C code that does not check for alarms periodically might cause the task to run forever - without respecting the timeout. SIGALRM is fired, passed to the long-running C-code and it waits until the c code checks for it. This caused a number of issues raised to us in the past. Now - supervisor route might be not viable, but, we could have additional fork here to **just** handle timeout and nothing else. I.e. (pseudocode): ``` child_pid = fork() if child: run_task() else: setproctitle("timeut....") join(chid_pid, timeout) if child_pid still running: kill_child_pid_possibly_escalating_SIGTERM_SIGKILL ``` That is quite a bit simpler in implementation (no context manager, no alarm/signal handling AT ALLl, 100% resiliance against long running C code. There is a LITTLE memory overhead. While Python is not "fork-friendly" due to reference counting, in this case it would entirely work because the "timeout" process would do literally NOTHING (except setting proctitle) after it forks the child - which means that the memory used by the "timeout" process would not be "copied on write" because no write would ever occur to any of the memory used by python objects. Effectively every task that needs timeout would be three processes: ``` supervisor -> timeout -> task ``` I think that is a cleaner and more resilient. The alarm handling is also somewhat obscure. such if/child/join is way cleaner and straightforward, less code, less boilerplate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
