+1 for - * No zombies * Change from Airflow 3 * Shubham's proposal to drop scheduler from the proposed new variable name: scheduler_task_heartbeat_timeout_threshold
Thanks, Kunal Bhattacharya On Fri, Feb 14, 2025 at 11:20 AM Amogh Desai <amoghdesai....@gmail.com> wrote: > Thanks for sparking this discussion, Karen. > > +1 to the idea of nuking "zombies". It is a term that requires some deep > understanding and also > is a term that can be misunderstood too: zombie tasks vs zombie processes. > > However, this would affect users who are currently using this config, so as > others mentioned, we should > add it for AF v3 and also add some migration rules for them. > > Yeah, adding "scheduler" to the config seems like overloading the config. I > would also be for a shorter config, > like: `task_instance_heartbeat_timeout`. > > Thanks & Regards, > Amogh Desai > > > On Thu, Feb 13, 2025 at 11:23 PM Mehta, Shubham <shu...@amazon.com.invalid > > > wrote: > > > Yes, I understand that heartbeat is applicable here, which is why I was > > only against 1 of the name change. > > "task_instance_heartbeat_timeout*" is definitely easier to follow and > > contextualize as compared to keeping scheduler prefix. > > > > > > Shubham > > > > On 2025-02-13, 9:18 AM, "Ryan Hatter" <ryan.hat...@astronomer.io.inva > > <mailto:ryan.hat...@astronomer.io.inva>LID> wrote: > > > > > > CAUTION: This email originated from outside of the organization. Do not > > click links or open attachments unless you can confirm the sender and > know > > the content is safe. > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > pouvez > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain > que > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > > > > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think > we > > > already have 6 configurations with "heartbeat" in it, and they're > > different > > > types of heartbeats. > > > > > > > > > > > > > > Anyways, I am against this name change: > > > scheduler_zombie_task_threshold --> > > > scheduler_task_heartbeat_timeout_threshold > > > > > > > > > > I think "heartbeat" is quite accurate here: the worker periodically > emits a > > heartbeat to the database that the scheduler uses to determine the health > > of the worker. What about *task_instance_heartbeat_timeout*? > > > > > > I also think we should combine "local_task_job_heartbeat_sec" with > > > "scheduler_zombie_task_threshold". That configurations description says > > > that it already defaults to zombie task threshold when set to 0. I > > haven't > > > dug into the code to see why they are different, but I really hope our > > > configuration documentation doesn't read like below in the future: > > > > > > > > > > > > > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the > > > LocalTaskJob should send heartbeat signals to the scheduler to notify > > it’s > > > still alive. If this value is set to 0, the heartbeat interval will > > default > > > to the value of [scheduler] > scheduler_task_heartbeat_timeout_threshold." > > > > > > > > > > +1 > > > > > > On Thu, Feb 13, 2025 at 3:37 AM Mehta, Shubham <shu...@amazon.com.inva > > <mailto:shu...@amazon.com.inva>lid> > > wrote: > > > > > > > I also agree with the idea that we should go for a name that's more > > > accurate and easier to understand. Also, +1 to starting with Airflow 3. > > > > > > Tbh "heartbeat" itself is an overused term/concept in Airflow. I think > we > > > already have 6 configurations with "heartbeat" in it, and they're > > different > > > types of heartbeats. > > > > > > Anyways, I am against this name change: > > > scheduler_zombie_task_threshold --> > > > scheduler_task_heartbeat_timeout_threshold > > > > > > We already have scheduler heartbeat, and let's drop the scheduler word > > > from this, so that users know that this is Task Instance heartbeat, not > > > scheduler. > > > > > > I also think we should combine "local_task_job_heartbeat_sec" with > > > "scheduler_zombie_task_threshold". That configurations description says > > > that it already defaults to zombie task threshold when set to 0. I > > haven't > > > dug into the code to see why they are different, but I really hope our > > > configuration documentation doesn't read like below in the future: > > > "local_task_job_heartbeat_sec: The frequency (in seconds) at which the > > > LocalTaskJob should send heartbeat signals to the scheduler to notify > > it’s > > > still alive. If this value is set to 0, the heartbeat interval will > > default > > > to the value of [scheduler] > scheduler_task_heartbeat_timeout_threshold." > > > > > > Thanks > > > Shubham > > > > > > On 2025-02-11, 1:15 PM, "Karen Braganza" <karenbraganz...@gmail.com > > <mailto:karenbraganz...@gmail.com> > > > <mailto:karenbraganz...@gmail.com <mailto:karenbraganz...@gmail.com>>> > > wrote: > > > > > > > > > CAUTION: This email originated from outside of the organization. Do not > > > click links or open attachments unless you can confirm the sender and > > know > > > the content is safe. > > > > > > > > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > externe. > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > > pouvez > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain > > que > > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > I have been working on this PR > > > <https://github.com/apache/airflow/pull/46257> < > > https://github.com/apache/airflow/pull/46257>> < > > > https://github.com/apache/airflow/pull/46257>> < > > https://github.com/apache/airflow/pull/46257&gt;>> to update our > > > documentation > > > on zombie tasks to reflect the terminology used in the user-facing > event > > > logs in Airflow 2.10+. The event logs use the terminology "heartbeat > > > timeout" whereas the documentation uses the terminology "zombie > tasks". I > > > would like to update the documentation to focus on the "heartbeat > > timeout" > > > terminology so that users are able to find and understand this > > > documentation easily when they see a "heartbeat timeout" in the event > > logs. > > > > > > > > > In the same vein, I think other user-facing configurations should also > be > > > updated to use the same terminology. I am proposing that we make the > > > following changes to Airflow configuration variables: > > > > > > > > > scheduler_zombie_task_threshold --> scheduler_task_heartbeat_ > > > timeout_threshold > > > zombie_detection_interval --> task_heartbeat_timeout_detection_interval > > > > > > > > > In addition to this, I propose that we also change the logs emitted by > > the > > > scheduler to use the "task heartbeat timeout" terminology. > > > > > > > > > For example, the below logs > > > < > > > > > > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040 > > > > < > > > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040> > > ;> > > > < > > > > > > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040> > > < > > > https://github.com/apache/airflow/blob/dea2cc9afc61caf49621c3b1923bcf90e96e17e9/airflow/jobs/scheduler_job_runner.py#L2040&gt > > > > > > ;> > > > : > > > self.log.error( > > > "Detected zombie job: %s " > > > "(See https://airflow.apache.org/docs/apache-airflow/" < > > https://airflow.apache.org/docs/apache-airflow/"> < > > > https://airflow.apache.org/docs/apache-airflow/"> < > > https://airflow.apache.org/docs/apache-airflow/&quot;>> > > > "stable/core-concepts/tasks.html#zombie-tasks)", > > > request, > > > ) > > > > > > > > > should become: > > > > > > > > > self.log.error( > > > "Detected task heartbeat timeout: %s " > > > "(See https://airflow.apache.org/docs/apache-airflow/" < > > https://airflow.apache.org/docs/apache-airflow/"> < > > > https://airflow.apache.org/docs/apache-airflow/"> < > > https://airflow.apache.org/docs/apache-airflow/&quot;>> > > > "stable/core-concepts/tasks.html#zombie-tasks)", > > > request, > > > ) > > > > > > > > > I wanted to start this discussion to get everyone's thoughts on my > > > proposal. Do you agree (or disagree) that at least all user-facing > > elements > > > of Airflow should use the "task heartbeat timeout" terminology instead > of > > > "zombie tasks" for uniformity? > > > > > > > > > I can add all of these changes to my PR. > > > > > > > > > Best, > > > Karen Braganza > > > > > > > > > > > > > > > < > > > > > > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval > > > > < > > > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval> > > ;> > > > < > > > > > > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval> > > < > > > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#zombie-detection-interval&gt > > > > > > ;> > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org <mailto: > > dev-unsubscr...@airflow.apache.org> > > > For additional commands, e-mail: dev-h...@airflow.apache.org <mailto: > > dev-h...@airflow.apache.org> > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > >