+1 to the idea and to restrict the change to Airflow 3 only ________________________________ From: Wei Lee <weilee...@gmail.com> Sent: 12 February 2025 17:01 To: dev@airflow.apache.org <dev@airflow.apache.org> Subject: Re: Updating "zombie task" terminology to "task heartbeat timeout"
I like this idea as well. But not sure whether it would affect monitoring. 🤔 If we’re to introduce it, we’d better make it airflow 3 only and make sure we add a migration rule as we’re changing the configuration Best, Wei > On Feb 12, 2025, at 6:10 AM, Ryan Hatter <ryan.hat...@astronomer.io.invalid> > wrote: > > I love it. "heartbeat timeout" is obvious and has meaning in software > beyond Airflow, so it makes sense to stick with this verbiage and use it to > replace "zombie" in docs, configs, logs, and code IMO. > > On Tue, Feb 11, 2025 at 4:15 PM Karen Braganza <karenbraganz...@gmail.com> > wrote: > >> Hi, >> >> I have been working on this PR >> <https://github.com/apache/airflow/pull/46257> to update our documentation >> on zombie tasks to reflect the terminology used in the user-facing event >> logs in Airflow 2.10+. The event logs use the terminology "heartbeat >> timeout" whereas the documentation uses the terminology "zombie tasks". I >> would like to update the documentation to focus on the "heartbeat timeout" >> terminology so that users are able to find and understand this >> documentation easily when they see a "heartbeat timeout" in the event logs. >> >> In the same vein, I think other user-facing configurations should also be >> updated to use the same terminology. I am proposing that we make the >> following changes to Airflow configuration variables: >> >> scheduler_zombie_task_threshold --> scheduler_task_heartbeat_ >> timeout_threshold >> zombie_detection_interval --> task_heartbeat_timeout_detection_interval >> >> In addition to this, I propose that we also change the logs emitted by the >> scheduler to use the "task heartbeat timeout" terminology. >> >> For example, the below logs >> < >> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040 >>> >> : >> self.log.error( >> "Detected zombie job: %s " >> "(See https://airflow.apache.org/docs/apache-airflow/" >> "stable/core-concepts/tasks.html#zombie-tasks)", >> request, >> ) >> >> should become: >> >> self.log.error( >> "Detected task heartbeat timeout: %s " >> "(See https://airflow.apache.org/docs/apache-airflow/" >> "stable/core-concepts/tasks.html#zombie-tasks)", >> request, >> ) >> >> I wanted to start this discussion to get everyone's thoughts on my >> proposal. Do you agree (or disagree) that at least all user-facing elements >> of Airflow should use the "task heartbeat timeout" terminology instead of >> "zombie tasks" for uniformity? >> >> I can add all of these changes to my PR. >> >> Best, >> Karen Braganza >> >> >> < >> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org