> > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think we > already have 6 configurations with "heartbeat" in it, and they're different > types of heartbeats.
Anyways, I am against this name change: > scheduler_zombie_task_threshold --> > scheduler_task_heartbeat_timeout_threshold I think "heartbeat" is quite accurate here: the worker periodically emits a heartbeat to the database that the scheduler uses to determine the health of the worker. What about *task_instance_heartbeat_timeout*? I also think we should combine "local_task_job_heartbeat_sec" with > "scheduler_zombie_task_threshold". That configurations description says > that it already defaults to zombie task threshold when set to 0. I haven't > dug into the code to see why they are different, but I really hope our > configuration documentation doesn't read like below in the future: > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the > LocalTaskJob should send heartbeat signals to the scheduler to notify it’s > still alive. If this value is set to 0, the heartbeat interval will default > to the value of [scheduler] scheduler_task_heartbeat_timeout_threshold." +1 On Thu, Feb 13, 2025 at 3:37 AM Mehta, Shubham <shu...@amazon.com.invalid> wrote: > I also agree with the idea that we should go for a name that's more > accurate and easier to understand. Also, +1 to starting with Airflow 3. > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think we > already have 6 configurations with "heartbeat" in it, and they're different > types of heartbeats. > > Anyways, I am against this name change: > scheduler_zombie_task_threshold --> > scheduler_task_heartbeat_timeout_threshold > > We already have scheduler heartbeat, and let's drop the scheduler word > from this, so that users know that this is Task Instance heartbeat, not > scheduler. > > I also think we should combine "local_task_job_heartbeat_sec" with > "scheduler_zombie_task_threshold". That configurations description says > that it already defaults to zombie task threshold when set to 0. I haven't > dug into the code to see why they are different, but I really hope our > configuration documentation doesn't read like below in the future: > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the > LocalTaskJob should send heartbeat signals to the scheduler to notify it’s > still alive. If this value is set to 0, the heartbeat interval will default > to the value of [scheduler] scheduler_task_heartbeat_timeout_threshold." > > Thanks > Shubham > > On 2025-02-11, 1:15 PM, "Karen Braganza" <karenbraganz...@gmail.com > <mailto:karenbraganz...@gmail.com>> wrote: > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que > le contenu ne présente aucun risque. > > > > > > > Hi, > > > I have been working on this PR > <https://github.com/apache/airflow/pull/46257> < > https://github.com/apache/airflow/pull/46257>> to update our > documentation > on zombie tasks to reflect the terminology used in the user-facing event > logs in Airflow 2.10+. The event logs use the terminology "heartbeat > timeout" whereas the documentation uses the terminology "zombie tasks". I > would like to update the documentation to focus on the "heartbeat timeout" > terminology so that users are able to find and understand this > documentation easily when they see a "heartbeat timeout" in the event logs. > > > In the same vein, I think other user-facing configurations should also be > updated to use the same terminology. I am proposing that we make the > following changes to Airflow configuration variables: > > > scheduler_zombie_task_threshold --> scheduler_task_heartbeat_ > timeout_threshold > zombie_detection_interval --> task_heartbeat_timeout_detection_interval > > > In addition to this, I propose that we also change the logs emitted by the > scheduler to use the "task heartbeat timeout" terminology. > > > For example, the below logs > < > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040> > < > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040> > ;> > : > self.log.error( > "Detected zombie job: %s " > "(See https://airflow.apache.org/docs/apache-airflow/" < > https://airflow.apache.org/docs/apache-airflow/"> > "stable/core-concepts/tasks.html#zombie-tasks)", > request, > ) > > > should become: > > > self.log.error( > "Detected task heartbeat timeout: %s " > "(See https://airflow.apache.org/docs/apache-airflow/" < > https://airflow.apache.org/docs/apache-airflow/"> > "stable/core-concepts/tasks.html#zombie-tasks)", > request, > ) > > > I wanted to start this discussion to get everyone's thoughts on my > proposal. Do you agree (or disagree) that at least all user-facing elements > of Airflow should use the "task heartbeat timeout" terminology instead of > "zombie tasks" for uniformity? > > > I can add all of these changes to my PR. > > > Best, > Karen Braganza > > > > > < > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval> > < > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval> > ;> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > For additional commands, e-mail: dev-h...@airflow.apache.org >