>
> Tbh "heartbeat" itself is an overused term/concept in Airflow. I think we
> already have 6 configurations with "heartbeat" in it, and they're different
> types of heartbeats.



Anyways, I am against this name change:
> scheduler_zombie_task_threshold  -->
> scheduler_task_heartbeat_timeout_threshold


I think "heartbeat" is quite accurate here: the worker periodically emits a
heartbeat to the database that the scheduler uses to determine the health
of the worker. What about *task_instance_heartbeat_timeout*?

I also think we should combine "local_task_job_heartbeat_sec" with
> "scheduler_zombie_task_threshold". That configurations description says
> that it already defaults to zombie task threshold when set to 0. I haven't
> dug into the code to see why they are different, but I really hope our
> configuration documentation doesn't read like below in the future:
>


"local_task_job_heartbeat_sec: The frequency (in seconds) at which the
> LocalTaskJob should send heartbeat signals to the scheduler to notify it’s
> still alive. If this value is set to 0, the heartbeat interval will default
> to the value of [scheduler] scheduler_task_heartbeat_timeout_threshold."


+1

On Thu, Feb 13, 2025 at 3:37 AM Mehta, Shubham <shu...@amazon.com.invalid>
wrote:

> I also agree with the idea that we should go for a name that's more
> accurate and easier to understand. Also, +1 to starting with Airflow 3.
>
> Tbh "heartbeat" itself is an overused term/concept in Airflow. I think we
> already have 6 configurations with "heartbeat" in it, and they're different
> types of heartbeats.
>
> Anyways, I am against this name change:
> scheduler_zombie_task_threshold  -->
> scheduler_task_heartbeat_timeout_threshold
>
> We already have scheduler heartbeat, and let's drop the scheduler word
> from this, so that users know that this is Task Instance heartbeat, not
> scheduler.
>
> I also think we should combine "local_task_job_heartbeat_sec" with
> "scheduler_zombie_task_threshold". That configurations description says
> that it already defaults to zombie task threshold when set to 0. I haven't
> dug into the code to see why they are different, but I really hope our
> configuration documentation doesn't read like below in the future:
> "local_task_job_heartbeat_sec: The frequency (in seconds) at which the
> LocalTaskJob should send heartbeat signals to the scheduler to notify it’s
> still alive. If this value is set to 0, the heartbeat interval will default
> to the value of [scheduler] scheduler_task_heartbeat_timeout_threshold."
>
> Thanks
> Shubham
>
> On 2025-02-11, 1:15 PM, "Karen Braganza" <karenbraganz...@gmail.com
> <mailto:karenbraganz...@gmail.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
>
>
>
> Hi,
>
>
> I have been working on this PR
> <https://github.com/apache/airflow/pull/46257> <
> https://github.com/apache/airflow/pull/46257&gt;> to update our
> documentation
> on zombie tasks to reflect the terminology used in the user-facing event
> logs in Airflow 2.10+. The event logs use the terminology "heartbeat
> timeout" whereas the documentation uses the terminology "zombie tasks". I
> would like to update the documentation to focus on the "heartbeat timeout"
> terminology so that users are able to find and understand this
> documentation easily when they see a "heartbeat timeout" in the event logs.
>
>
> In the same vein, I think other user-facing configurations should also be
> updated to use the same terminology. I am proposing that we make the
> following changes to Airflow configuration variables:
>
>
> scheduler_zombie_task_threshold --> scheduler_task_heartbeat_
> timeout_threshold
> zombie_detection_interval --> task_heartbeat_timeout_detection_interval
>
>
> In addition to this, I propose that we also change the logs emitted by the
> scheduler to use the "task heartbeat timeout" terminology.
>
>
> For example, the below logs
> <
> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040>
> <
> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040&gt
> ;>
> :
> self.log.error(
> "Detected zombie job: %s "
> "(See https://airflow.apache.org/docs/apache-airflow/"; <
> https://airflow.apache.org/docs/apache-airflow/&quot;>
> "stable/core-concepts/tasks.html#zombie-tasks)",
> request,
> )
>
>
> should become:
>
>
> self.log.error(
> "Detected task heartbeat timeout: %s "
> "(See https://airflow.apache.org/docs/apache-airflow/"; <
> https://airflow.apache.org/docs/apache-airflow/&quot;>
> "stable/core-concepts/tasks.html#zombie-tasks)",
> request,
> )
>
>
> I wanted to start this discussion to get everyone's thoughts on my
> proposal. Do you agree (or disagree) that at least all user-facing elements
> of Airflow should use the "task heartbeat timeout" terminology instead of
> "zombie tasks" for uniformity?
>
>
> I can add all of these changes to my PR.
>
>
> Best,
> Karen Braganza
>
>
>
>
> <
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval>
> <
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval&gt
> ;>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>

Reply via email to