repl-chris commented on code in PR #23432:
URL: https://github.com/apache/airflow/pull/23432#discussion_r867250315
##########
airflow/config_templates/config.yml:
##########
@@ -1788,6 +1788,13 @@
type: string
example: ~
default: "False"
+ - name: stuck_queued_task_check_interval
Review Comment:
the orphan/adoption stuff doesn't really have an appropriate setting to use
instead afaik - the "adopted task timeout" runs on every heartbeat, which is
fine because it's very cheap unless a task has actually timed out....in
contrast, this check will (minimally) issue a database query every time it
runs, which we probably don't want to do on every heartbeat.
Alternatively, instead of issuing the query to find these tasks we could
internally track "sent to celery time" and "last known state" by task....then
we wouldn't need to issue the query and we could run on every heartbeat
instead...I kinda like this option, but if we're talking about separating
"lost" from "just in a long queue" then it starts to get sketchy (see separate
comment below)
In theory this feature also largely makes
`_check_for_stalled_adopted_tasks()` unnecessary, as "stalled adopted tasks"
and "lost tasks" are effectively the same thing. I didn't remove
`_check_for_stalled_adopted_tasks()` though as I figured it added unnecessary
risk, and I was trying to stay true to the design of the original PR by
ephraimbuddy. A stalled adopted task, I guess, also has a slight behavioural
difference - it fails the task rather than re-scheduling it (which I'm not
entirely convinced is really correct)
##########
airflow/config_templates/config.yml:
##########
@@ -1788,6 +1788,13 @@
type: string
example: ~
default: "False"
+ - name: stuck_queued_task_check_interval
Review Comment:
the orphan/adoption stuff doesn't really have an appropriate setting to use
instead afaik - the "adopted task timeout" runs on every heartbeat, which is
fine because it's very cheap unless a task has actually timed out....in
contrast, this check will (minimally) issue a database query every time it
runs, which we probably don't want to do on every heartbeat.
Alternatively, instead of issuing the query to find these tasks we could
internally track "sent to celery time" and "last known state" by task....then
we wouldn't need to issue the query and we could run on every heartbeat
instead...I kinda like this option, but if we're talking about separating
"lost" from "just in a long queue" then it starts to get sketchy (see separate
comment below)
In theory this feature also largely makes
`_check_for_stalled_adopted_tasks()` unnecessary, as "stalled adopted tasks"
and "lost tasks" are effectively the same thing. I didn't remove
`_check_for_stalled_adopted_tasks()` though as I figured it added unnecessary
risk, and I was trying to stay true to the design of the original PR by
ephraimbuddy. A stalled adopted task, I guess, also has a slight behavioural
difference - it fails the task rather than re-scheduling it (which I'm not
entirely convinced is really correct)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]