I like this idea as well. But not sure whether it would affect monitoring. 🤔 If 
we’re to introduce it, we’d better make it airflow 3 only and make sure we add 
a migration rule as we’re changing the configuration

Best,
Wei

> On Feb 12, 2025, at 6:10 AM, Ryan Hatter <ryan.hat...@astronomer.io.invalid> 
> wrote:
> 
> I love it. "heartbeat timeout" is obvious and has meaning in software
> beyond Airflow, so it makes sense to stick with this verbiage and use it to
> replace "zombie" in docs, configs, logs, and code IMO.
> 
> On Tue, Feb 11, 2025 at 4:15 PM Karen Braganza <karenbraganz...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> I have been working on this PR
>> <https://github.com/apache/airflow/pull/46257> to update our documentation
>> on zombie tasks to reflect the terminology used in the user-facing event
>> logs in Airflow 2.10+. The event logs use the terminology "heartbeat
>> timeout" whereas the documentation uses the terminology "zombie tasks". I
>> would like to update the documentation to focus on the "heartbeat timeout"
>> terminology so that users are able to find and understand this
>> documentation easily when they see a "heartbeat timeout" in the event logs.
>> 
>> In the same vein, I think other user-facing configurations should also be
>> updated to use the same terminology. I am proposing that we make the
>> following changes to Airflow configuration variables:
>> 
>> scheduler_zombie_task_threshold  -->  scheduler_task_heartbeat_
>> timeout_threshold
>> zombie_detection_interval --> task_heartbeat_timeout_detection_interval
>> 
>> In addition to this, I propose that we also change the logs emitted by the
>> scheduler to use the "task heartbeat timeout" terminology.
>> 
>> For example, the below logs
>> <
>> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040
>>> 
>> :
>> self.log.error(
>>                "Detected zombie job: %s "
>>                "(See https://airflow.apache.org/docs/apache-airflow/";
>>                "stable/core-concepts/tasks.html#zombie-tasks)",
>>                request,
>>            )
>> 
>> should become:
>> 
>> self.log.error(
>>                "Detected task heartbeat timeout: %s "
>>                "(See https://airflow.apache.org/docs/apache-airflow/";
>>                "stable/core-concepts/tasks.html#zombie-tasks)",
>>                request,
>>            )
>> 
>> I wanted to start this discussion to get everyone's thoughts on my
>> proposal. Do you agree (or disagree) that at least all user-facing elements
>> of Airflow should use the "task heartbeat timeout" terminology instead of
>> "zombie tasks" for uniformity?
>> 
>> I can add all of these changes to my PR.
>> 
>> Best,
>> Karen Braganza
>> 
>> 
>> <
>> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to