kaxil commented on PR #55767: URL: https://github.com/apache/airflow/pull/55767#issuecomment-3304266244
>When tasks are killed by system signals (SIGKILL for OOM, SIGTERM for worker restarts), they immediately go to FAILED state instead of respecting the task retries set and going to UP_FOR_RETRY state. This creates unexpected behavior where exception based failures respect retries but signal based failures don't. How common is this scenario (excluding manually killing task process)? Since the supervisor and task processes are running in the same container, wouldn't an OOM condition typically kill the entire container rather than just the individual task process? In the more common case where the entire container gets OOM-killed: 1) The supervisor process would also die 2) Heartbeat to the scheduler would fail 3) Scheduler would receive a FAILED executor event and handle retries through the normal `process_executor_events()` → `handle_failure()` path -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
