repl-chris commented on code in PR #23432:
URL: https://github.com/apache/airflow/pull/23432#discussion_r867250315


##########
airflow/config_templates/config.yml:
##########
@@ -1788,6 +1788,13 @@
       type: string
       example: ~
       default: "False"
+    - name: stuck_queued_task_check_interval

Review Comment:
   the orphan/adoption stuff doesn't really have an appropriate setting to use 
instead afaik - the "adopted task timeout" runs on every heartbeat, which is 
fine because it's very cheap unless a task has actually timed out....in 
contrast, this check will (minimally) issue a database query every time it 
runs, which we probably don't want to do on every heartbeat.
   
   Alternatively, instead of issuing the query to find these tasks we could 
internally track "sent to celery time" and "last known state" by task....then 
we wouldn't need to issue the query and we could run on every heartbeat 
instead...I kinda like this option, but if we're talking about separating 
"lost" from "just in a long queue" then it starts to get sketchy (see separate 
comment below)
   
   In theory this feature also largely makes 
`_check_for_stalled_adopted_tasks()` unnecessary, as "stalled adopted tasks" 
and "lost tasks" are effectively the same thing. I didn't remove 
`_check_for_stalled_adopted_tasks()` though as I figured it added unnecessary 
risk, and I was trying to stay true to the design of the original PR by 
ephraimbuddy. A stalled adopted task, I guess, also has a slight behavioural 
difference - it fails the task rather than re-scheduling it (which I'm not 
entirely convinced is really correct)



##########
airflow/config_templates/config.yml:
##########
@@ -1788,6 +1788,13 @@
       type: string
       example: ~
       default: "False"
+    - name: stuck_queued_task_check_interval

Review Comment:
   the orphan/adoption stuff doesn't really have an appropriate setting to use 
instead afaik - the "adopted task timeout" runs on every heartbeat, which is 
fine because it's very cheap unless a task has actually timed out....in 
contrast, this check will (minimally) issue a database query every time it 
runs, which we probably don't want to do on every heartbeat.
   
   Alternatively, instead of issuing the query to find these tasks we could 
internally track "sent to celery time" and "last known state" by task....then 
we wouldn't need to issue the query and we could run on every heartbeat 
instead...I kinda like this option, but if we're talking about separating 
"lost" from "just in a long queue" then it starts to get sketchy (see separate 
comment below)
   
   In theory this feature also largely makes 
`_check_for_stalled_adopted_tasks()` unnecessary, as "stalled adopted tasks" 
and "lost tasks" are effectively the same thing. I didn't remove 
`_check_for_stalled_adopted_tasks()` though as I figured it added unnecessary 
risk, and I was trying to stay true to the design of the original PR by 
ephraimbuddy. A stalled adopted task, I guess, also has a slight behavioural 
difference - it fails the task rather than re-scheduling it (which I'm not 
entirely convinced is really correct)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to