Yes, I understand that heartbeat is applicable here, which is why I was only against 1 of the name change. "task_instance_heartbeat_timeout*" is definitely easier to follow and contextualize as compared to keeping scheduler prefix.
Shubham On 2025-02-13, 9:18 AM, "Ryan Hatter" <ryan.hat...@astronomer.io.inva <mailto:ryan.hat...@astronomer.io.inva>LID> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le contenu ne présente aucun risque. > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think we > already have 6 configurations with "heartbeat" in it, and they're different > types of heartbeats. Anyways, I am against this name change: > scheduler_zombie_task_threshold --> > scheduler_task_heartbeat_timeout_threshold I think "heartbeat" is quite accurate here: the worker periodically emits a heartbeat to the database that the scheduler uses to determine the health of the worker. What about *task_instance_heartbeat_timeout*? I also think we should combine "local_task_job_heartbeat_sec" with > "scheduler_zombie_task_threshold". That configurations description says > that it already defaults to zombie task threshold when set to 0. I haven't > dug into the code to see why they are different, but I really hope our > configuration documentation doesn't read like below in the future: > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the > LocalTaskJob should send heartbeat signals to the scheduler to notify it’s > still alive. If this value is set to 0, the heartbeat interval will default > to the value of [scheduler] scheduler_task_heartbeat_timeout_threshold." +1 On Thu, Feb 13, 2025 at 3:37 AM Mehta, Shubham <shu...@amazon.com.inva <mailto:shu...@amazon.com.inva>lid> wrote: > I also agree with the idea that we should go for a name that's more > accurate and easier to understand. Also, +1 to starting with Airflow 3. > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think we > already have 6 configurations with "heartbeat" in it, and they're different > types of heartbeats. > > Anyways, I am against this name change: > scheduler_zombie_task_threshold --> > scheduler_task_heartbeat_timeout_threshold > > We already have scheduler heartbeat, and let's drop the scheduler word > from this, so that users know that this is Task Instance heartbeat, not > scheduler. > > I also think we should combine "local_task_job_heartbeat_sec" with > "scheduler_zombie_task_threshold". That configurations description says > that it already defaults to zombie task threshold when set to 0. I haven't > dug into the code to see why they are different, but I really hope our > configuration documentation doesn't read like below in the future: > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the > LocalTaskJob should send heartbeat signals to the scheduler to notify it’s > still alive. If this value is set to 0, the heartbeat interval will default > to the value of [scheduler] scheduler_task_heartbeat_timeout_threshold." > > Thanks > Shubham > > On 2025-02-11, 1:15 PM, "Karen Braganza" <karenbraganz...@gmail.com > <mailto:karenbraganz...@gmail.com> > <mailto:karenbraganz...@gmail.com <mailto:karenbraganz...@gmail.com>>> wrote: > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que > le contenu ne présente aucun risque. > > > > > > > Hi, > > > I have been working on this PR > <https://github.com/apache/airflow/pull/46257> > <https://github.com/apache/airflow/pull/46257>> < > https://github.com/apache/airflow/pull/46257>> > <https://github.com/apache/airflow/pull/46257&gt;>> to update our > documentation > on zombie tasks to reflect the terminology used in the user-facing event > logs in Airflow 2.10+. The event logs use the terminology "heartbeat > timeout" whereas the documentation uses the terminology "zombie tasks". I > would like to update the documentation to focus on the "heartbeat timeout" > terminology so that users are able to find and understand this > documentation easily when they see a "heartbeat timeout" in the event logs. > > > In the same vein, I think other user-facing configurations should also be > updated to use the same terminology. I am proposing that we make the > following changes to Airflow configuration variables: > > > scheduler_zombie_task_threshold --> scheduler_task_heartbeat_ > timeout_threshold > zombie_detection_interval --> task_heartbeat_timeout_detection_interval > > > In addition to this, I propose that we also change the logs emitted by the > scheduler to use the "task heartbeat timeout" terminology. > > > For example, the below logs > < > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040> > > <https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040>> > < > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040> > > <https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040&gt> > ;> > : > self.log.error( > "Detected zombie job: %s " > "(See https://airflow.apache.org/docs/apache-airflow/" > <https://airflow.apache.org/docs/apache-airflow/"> < > https://airflow.apache.org/docs/apache-airflow/"> > <https://airflow.apache.org/docs/apache-airflow/&quot;>> > "stable/core-concepts/tasks.html#zombie-tasks)", > request, > ) > > > should become: > > > self.log.error( > "Detected task heartbeat timeout: %s " > "(See https://airflow.apache.org/docs/apache-airflow/" > <https://airflow.apache.org/docs/apache-airflow/"> < > https://airflow.apache.org/docs/apache-airflow/"> > <https://airflow.apache.org/docs/apache-airflow/&quot;>> > "stable/core-concepts/tasks.html#zombie-tasks)", > request, > ) > > > I wanted to start this discussion to get everyone's thoughts on my > proposal. Do you agree (or disagree) that at least all user-facing elements > of Airflow should use the "task heartbeat timeout" terminology instead of > "zombie tasks" for uniformity? > > > I can add all of these changes to my PR. > > > Best, > Karen Braganza > > > > > < > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval> > > <https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval>> > < > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval> > > <https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval&gt> > ;> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > <mailto:dev-unsubscr...@airflow.apache.org> > For additional commands, e-mail: dev-h...@airflow.apache.org > <mailto:dev-h...@airflow.apache.org> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org