potiuk commented on issue #57174: URL: https://github.com/apache/airflow/issues/57174#issuecomment-3446710082
Corrrect - currently timeout is handled in the task itself via SIGALRM - and if another signal (SIGSEGV) comes it might get task into a hanging state. This happens usually when you have native code that runs in a long tight loop and does not handle signals in Python. We've discussed it in the past and solution to that is to handle the timeoit in supervisor task (which is in another, parent process - tasks are in forked processes). In order to communicate that we have to make a new task -> supervisor API (to send the timeout information after parsing the DAG, because the supervisor does not have that information as it d does not parse the task). This would generally handle all the possible cases where the task hangs - including possible escalation of signals to kill such forked tasks (SIGTERM followed by SIGKILL after a short additional timeout if the task does not exit). That's all possible and if someone would like to take on that task - it's not even that difficult. I marked it as a good-first issue. cc: @ashb @amoghrajesh if you have something to add. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
