potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546
##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
return DagRunInfo.exact(run_after)
+class ContinuousTimetable(_TrivialTimetable):
+ """Timetable that schedules continually, while still respecting start_date
and end_date
+
+ This corresponds to ``schedule="@continuous"``.
+ """
+
+ description: str = "As frequently as possible while still obeying
max_active_runs"
Review Comment:
> I thought about this but ultimately thought it would be weird to impose an
artificial limitation like this. I think in some cases a user might want to
have multiple runs executing at all times (for example, a job with multiple
stages which uses depends_on_past for continuous pipelined execution).
Additionally, a similar hazard already exists with schedule_interval="* * * *
*" which could create many jobs quite quickly.
I think this is quite different case. And it is better handled with dynamic
tasks (especially when we get "depth-first execution" working rather than with
spawning multiple dag runs). The difference vs. the `* * * * *` is that there
we know that the speed of creating of the DagRuns will be 1/minute. Full stop.
With "continuous" allowing for multiple dag runs - what is the speed of
creation of those? Can we limit that? In `* * * * *` case, every dag run has
it's own "per minute" logical date and displaying it in the UI makes perfect
sense. In case of "continuous" we can of course enforce monotonic time for
logical date somehow but having 100s of dagruns differing by microseconds
sounds strange.
I think when you try to describe it in the docs in the way to explain the
users intricacies of such approach where you have more than one
"max_active_run" and explain in detail what they can expect. For example you
will have to explain that the speed of ramping up such running dagruns and
maximum number of parallel runs will depend not only on scheduler parameters
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule
but also on other DAGs being scheduled at the same time (because they will be
competing for the same dag run limit), the speed of scheduler and number of
schedulers - all connected in a non-trivial way and next to impossible to
control by the user as it will change dynamically. Contrast this with dynamic
task mapping where you have complete control not only over the maximum number
of task instances but hopefully in near future also independently for each
mapped task group how many of those tasks can run in paral
lell : https://github.com/apache/airflow/pull/29094.
Also those pale in comparision with the very bad UI experience people will
have. If people will start using continuous to spawn massive number of
parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely
not ready for that. Set max_active_run to 1000 and try to visualize those
dagruns in the current UI. That won't work. No pagination, no way to limit the
results and navigate between them, no way to really distinguish one dag run
from another. This is - again in stark contrast in all the investment that
@bbovenzi made into making dynamic task UI navigable and usable when you have
100s and 1000s dynamically mapped tasks (Task Groups soon I guess).
I'd say the only real good use case for continuous is with max_active_run =1
and we should limit it to only that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]