+1 for -
* No zombies
* Change from Airflow 3
* Shubham's proposal to drop scheduler from the proposed new variable name:
scheduler_task_heartbeat_timeout_threshold

Thanks,
Kunal Bhattacharya


On Fri, Feb 14, 2025 at 11:20 AM Amogh Desai <amoghdesai....@gmail.com>
wrote:

> Thanks for sparking this discussion, Karen.
>
> +1 to the idea of nuking "zombies". It is a term that requires some deep
> understanding and also
> is a term that can be misunderstood too: zombie tasks vs zombie processes.
>
> However, this would affect users who are currently using this config, so as
> others mentioned, we should
> add it for AF v3 and also add some migration rules for them.
>
> Yeah, adding "scheduler" to the config seems like overloading the config. I
> would also be for a shorter config,
> like: `task_instance_heartbeat_timeout`.
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Thu, Feb 13, 2025 at 11:23 PM Mehta, Shubham <shu...@amazon.com.invalid
> >
> wrote:
>
> > Yes, I understand that heartbeat is applicable here, which is why I was
> > only against 1 of the name change.
> > "task_instance_heartbeat_timeout*" is definitely easier to follow and
> > contextualize as compared to keeping scheduler prefix.
> >
> >
> > Shubham
> >
> > On 2025-02-13, 9:18 AM, "Ryan Hatter" <ryan.hat...@astronomer.io.inva
> > <mailto:ryan.hat...@astronomer.io.inva>LID> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > le contenu ne présente aucun risque.
> >
> >
> >
> >
> >
> >
> > >
> > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think
> we
> > > already have 6 configurations with "heartbeat" in it, and they're
> > different
> > > types of heartbeats.
> >
> >
> >
> >
> >
> >
> > Anyways, I am against this name change:
> > > scheduler_zombie_task_threshold -->
> > > scheduler_task_heartbeat_timeout_threshold
> >
> >
> >
> >
> > I think "heartbeat" is quite accurate here: the worker periodically
> emits a
> > heartbeat to the database that the scheduler uses to determine the health
> > of the worker. What about *task_instance_heartbeat_timeout*?
> >
> >
> > I also think we should combine "local_task_job_heartbeat_sec" with
> > > "scheduler_zombie_task_threshold". That configurations description says
> > > that it already defaults to zombie task threshold when set to 0. I
> > haven't
> > > dug into the code to see why they are different, but I really hope our
> > > configuration documentation doesn't read like below in the future:
> > >
> >
> >
> >
> >
> > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the
> > > LocalTaskJob should send heartbeat signals to the scheduler to notify
> > it’s
> > > still alive. If this value is set to 0, the heartbeat interval will
> > default
> > > to the value of [scheduler]
> scheduler_task_heartbeat_timeout_threshold."
> >
> >
> >
> >
> > +1
> >
> >
> > On Thu, Feb 13, 2025 at 3:37 AM Mehta, Shubham <shu...@amazon.com.inva
> > <mailto:shu...@amazon.com.inva>lid>
> > wrote:
> >
> >
> > > I also agree with the idea that we should go for a name that's more
> > > accurate and easier to understand. Also, +1 to starting with Airflow 3.
> > >
> > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think
> we
> > > already have 6 configurations with "heartbeat" in it, and they're
> > different
> > > types of heartbeats.
> > >
> > > Anyways, I am against this name change:
> > > scheduler_zombie_task_threshold -->
> > > scheduler_task_heartbeat_timeout_threshold
> > >
> > > We already have scheduler heartbeat, and let's drop the scheduler word
> > > from this, so that users know that this is Task Instance heartbeat, not
> > > scheduler.
> > >
> > > I also think we should combine "local_task_job_heartbeat_sec" with
> > > "scheduler_zombie_task_threshold". That configurations description says
> > > that it already defaults to zombie task threshold when set to 0. I
> > haven't
> > > dug into the code to see why they are different, but I really hope our
> > > configuration documentation doesn't read like below in the future:
> > > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the
> > > LocalTaskJob should send heartbeat signals to the scheduler to notify
> > it’s
> > > still alive. If this value is set to 0, the heartbeat interval will
> > default
> > > to the value of [scheduler]
> scheduler_task_heartbeat_timeout_threshold."
> > >
> > > Thanks
> > > Shubham
> > >
> > > On 2025-02-11, 1:15 PM, "Karen Braganza" <karenbraganz...@gmail.com
> > <mailto:karenbraganz...@gmail.com>
> > > <mailto:karenbraganz...@gmail.com <mailto:karenbraganz...@gmail.com>>>
> > wrote:
> > >
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > >
> > > I have been working on this PR
> > > <https://github.com/apache/airflow/pull/46257> <
> > https://github.com/apache/airflow/pull/46257&gt;> <
> > > https://github.com/apache/airflow/pull/46257&gt;> <
> > https://github.com/apache/airflow/pull/46257&amp;gt;&gt;> to update our
> > > documentation
> > > on zombie tasks to reflect the terminology used in the user-facing
> event
> > > logs in Airflow 2.10+. The event logs use the terminology "heartbeat
> > > timeout" whereas the documentation uses the terminology "zombie
> tasks". I
> > > would like to update the documentation to focus on the "heartbeat
> > timeout"
> > > terminology so that users are able to find and understand this
> > > documentation easily when they see a "heartbeat timeout" in the event
> > logs.
> > >
> > >
> > > In the same vein, I think other user-facing configurations should also
> be
> > > updated to use the same terminology. I am proposing that we make the
> > > following changes to Airflow configuration variables:
> > >
> > >
> > > scheduler_zombie_task_threshold --> scheduler_task_heartbeat_
> > > timeout_threshold
> > > zombie_detection_interval --> task_heartbeat_timeout_detection_interval
> > >
> > >
> > > In addition to this, I propose that we also change the logs emitted by
> > the
> > > scheduler to use the "task heartbeat timeout" terminology.
> > >
> > >
> > > For example, the below logs
> > > <
> > >
> >
> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040
> >
> > <
> >
> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040&gt
> > ;>
> > > <
> > >
> >
> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040&gt
> > <
> >
> https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040&amp;gt
> > >
> > > ;>
> > > :
> > > self.log.error(
> > > "Detected zombie job: %s "
> > > "(See https://airflow.apache.org/docs/apache-airflow/"; <
> > https://airflow.apache.org/docs/apache-airflow/&quot;> <
> > > https://airflow.apache.org/docs/apache-airflow/&quot;> <
> > https://airflow.apache.org/docs/apache-airflow/&amp;quot;&gt;>
> > > "stable/core-concepts/tasks.html#zombie-tasks)",
> > > request,
> > > )
> > >
> > >
> > > should become:
> > >
> > >
> > > self.log.error(
> > > "Detected task heartbeat timeout: %s "
> > > "(See https://airflow.apache.org/docs/apache-airflow/"; <
> > https://airflow.apache.org/docs/apache-airflow/&quot;> <
> > > https://airflow.apache.org/docs/apache-airflow/&quot;> <
> > https://airflow.apache.org/docs/apache-airflow/&amp;quot;&gt;>
> > > "stable/core-concepts/tasks.html#zombie-tasks)",
> > > request,
> > > )
> > >
> > >
> > > I wanted to start this discussion to get everyone's thoughts on my
> > > proposal. Do you agree (or disagree) that at least all user-facing
> > elements
> > > of Airflow should use the "task heartbeat timeout" terminology instead
> of
> > > "zombie tasks" for uniformity?
> > >
> > >
> > > I can add all of these changes to my PR.
> > >
> > >
> > > Best,
> > > Karen Braganza
> > >
> > >
> > >
> > >
> > > <
> > >
> >
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval
> >
> > <
> >
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval&gt
> > ;>
> > > <
> > >
> >
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval&gt
> > <
> >
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval&amp;gt
> > >
> > > ;>
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org <mailto:
> > dev-unsubscr...@airflow.apache.org>
> > > For additional commands, e-mail: dev-h...@airflow.apache.org <mailto:
> > dev-h...@airflow.apache.org>
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
>

Reply via email to