potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546
##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
return DagRunInfo.exact(run_after)
+class ContinuousTimetable(_TrivialTimetable):
+ """Timetable that schedules continually, while still respecting start_date
and end_date
+
+ This corresponds to ``schedule="@continuous"``.
+ """
+
+ description: str = "As frequently as possible while still obeying
max_active_runs"
Review Comment:
> I thought about this but ultimately thought it would be weird to impose an
artificial limitation like this. I think in some cases a user might want to
have multiple runs executing at all times (for example, a job with multiple
stages which uses depends_on_past for continuous pipelined execution).
Additionally, a similar hazard already exists with schedule_interval="* * * *
*" which could create many jobs quite quickly.
I think this is quite different case. And it is better handled with dynamic
tasks (especially when we get "depth-first execution" working rather than with
spawning multiple dag runs). The difference vs. the `* * * * *` is that there
we know that the speed of creating of the DagRuns will be 1/minute. Full stop.
With "continuous" allowing for multiple dag runs - what is the speed of
creation of those? Can we limit that? In `* * * * *` case, every dag run has
it's own "per minute" logical date and displaying it in the UI makes perfect
sense. In case of "continuous" we can of course enforce monotonic time for
logical date somehow but having 100s of dagruns differing by microseconds
sounds strange (and we will have a very likely race problem with multiple
schedulers trying to create dag runs with exact-same logical date at the same
tim)
I think when you try to describe it in the docs in the way to explain the
users intricacies of such approach where you have more than one
"max_active_run" and explain in detail what they can expect - you will see the
problem clearly. For example you will have to explain that the speed of
ramping up such running dagruns and maximum number of parallel runs will depend
not only on scheduler parameters
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule
but also on other DAGs being scheduled at the same time (because they will be
competing for the same dag run limit), the speed of DB, scheduler and number of
schedulers you run - all connected in a non-trivial way and next to impossible
to control by the user as it will change dynamically. Contrast this with
dynamic task mapping where you have complete control not only over the maximum
number of task instances but hopefully in near future also independently for
each mapped tas
k group how many of those tasks can run in parallell :
https://github.com/apache/airflow/pull/29094.
Also those pale in comparision with the very bad UI experience people will
have. If people will start using continuous to spawn massive number of
parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely
not ready for that. Set max_active_run to 1000 and try to visualize those
dagruns in the current UI. That won't work. No pagination, no way to limit the
results and navigate between them, no way to really distinguish one dag run
from another. This is - again in stark contrast in all the investment that
@bbovenzi made into making dynamic task UI navigable and usable when you have
100s and 1000s dynamically mapped tasks (Task Groups soon I hope).
I'd say the only real good use case for continuous is with max_active_run =1
and we should limit it to only that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]