I love it. "heartbeat timeout" is obvious and has meaning in software beyond Airflow, so it makes sense to stick with this verbiage and use it to replace "zombie" in docs, configs, logs, and code IMO.
On Tue, Feb 11, 2025 at 4:15 PM Karen Braganza <karenbraganz...@gmail.com> wrote: > Hi, > > I have been working on this PR > <https://github.com/apache/airflow/pull/46257> to update our documentation > on zombie tasks to reflect the terminology used in the user-facing event > logs in Airflow 2.10+. The event logs use the terminology "heartbeat > timeout" whereas the documentation uses the terminology "zombie tasks". I > would like to update the documentation to focus on the "heartbeat timeout" > terminology so that users are able to find and understand this > documentation easily when they see a "heartbeat timeout" in the event logs. > > In the same vein, I think other user-facing configurations should also be > updated to use the same terminology. I am proposing that we make the > following changes to Airflow configuration variables: > > scheduler_zombie_task_threshold --> scheduler_task_heartbeat_ > timeout_threshold > zombie_detection_interval --> task_heartbeat_timeout_detection_interval > > In addition to this, I propose that we also change the logs emitted by the > scheduler to use the "task heartbeat timeout" terminology. > > For example, the below logs > < > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040 > > > : > self.log.error( > "Detected zombie job: %s " > "(See https://airflow.apache.org/docs/apache-airflow/" > "stable/core-concepts/tasks.html#zombie-tasks)", > request, > ) > > should become: > > self.log.error( > "Detected task heartbeat timeout: %s " > "(See https://airflow.apache.org/docs/apache-airflow/" > "stable/core-concepts/tasks.html#zombie-tasks)", > request, > ) > > I wanted to start this discussion to get everyone's thoughts on my > proposal. Do you agree (or disagree) that at least all user-facing elements > of Airflow should use the "task heartbeat timeout" terminology instead of > "zombie tasks" for uniformity? > > I can add all of these changes to my PR. > > Best, > Karen Braganza > > > < > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval > > >