I think a lot of this has been discussed in dev calls but maybe never got 
documented anywhere for folks who aren't/weren't on the call.  I haven't been 
very good at all about keeping the AIP up to date after it was approved, or 
keeping community decisions and discussions easily discoverable, and that's on 
me.  Now that folks can see the feature as-implemented, Amogh raised some good 
questions that are worth addressing here so the rationale is captured somewhere 
persistent. I also want to flag a couple of ideas that came out of that 
discussion as potential follow-up improvements.

The plan has been that we treat a deadline callback as "finishing old work" and 
therefore all callbacks get priority over new tasks.  The core design principle 
here is that deadline callbacks need to be timely — if a deadline fires, the 
user expects to know about it promptly. A deadline callback that gets stuck 
behind the same bottleneck it's trying to alert you about isn't useful.

To that end, what I implemented was effectively two queues:  task_instances get 
queued by priority_weight (as they "always" have), and callbacks get queued in 
FIFO order.  When the scheduler looks for new work it will always fill slots 
from the callback queue first.  Parallelism is honored, but pools and 
max_active_tasks_per_dag are ignored.  The reason for that is twofold.  Let's 
say a user has a Dag with a deadline at 10-minutes.  There are two cases where 
that deadline can be triggered:

Case 1) This Dag is still running at the deadline, the callback triggers - 
maybe a Slack message that the report is still being generated and may be 
delayed today - while the Dag still runs.

In this case, `pools` and `tasks_per_dag` may make sense as the dag is still 
active, but I'd argue that in this case the deadline should "break through" 
that pool/task limit and execute regardless.  For example, if the issue is that 
the tasks stalled because the worker is hitting resource constraints then 
tacking the deadline callback on behind that roadblock and waiting for them to 
finish before alerting the user that there is a problem defeats the intent of 
the Deadline.  It is still bound by the executor-level parallelism so it's not 
allowed to just run rampant, but it isn't bound by the dag-level constraints.

Case 2) The Dag failed at 2 minutes and the callback is triggered at the 
2-minute point.  At this point there is no `pool` or `max_tasks` to consider.  
The task instances should have released their slots and aren't being counted 
toward the active tasks.  Even if other Dags have started up and are claiming 
pool slots, this callback isn't part of that Dag and shouldn't be lumped with 
its tasks.

A `max_executor_callbacks` setting which parallels max_executor_tasks is one 
idea that has come up.  It's not a bad idea; I guess if the whole building is 
burning down, you don't need to know that each floor is burning. It feels a bit 
against the intention of the callbacks getting prioritized over tasks, but if 
it's a user-defined option then that's on the user. It might be worth 
considering as a follow-up improvement if anyone thinks it's something we 
really need.

The other decision that seems to be contentious is that callbacks are FIFO 
instead of implementing a `callback-priority-weight`.  FIFO seems fine for a 
callback queue and how I envision the feature being used, but maybe once the 
feature gets in the hands of users they'll find a need for it.  We as a 
community have been saying for a while that there are way too many user-facing 
knobs in the settings and this felt like a logical place for us to be 
opinionated in the code.  With FIFO, the feature's behavior is straightforward 
and predictable, and it's easier to add the weight later if there is demand 
than it would be to remove it later if we decide to prune the config options in 
a future version.

For now, I think we're in a good place to ship with the current behavior and we 
can iterate based on real-world usage.  I'll cut some Issues to make sure the 
follow-up ideas from here and the PR are tracked so they don't get lost.  If 
anyone has concerns about the current behavior that should be addressed before 
launch, let me know.


  *
ferruzzi

Reply via email to