[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

via GitHub Sun, 05 Mar 2023 01:38:22 -0800


potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546



##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date 
and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying 
max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an 
artificial limitation like this. I think in some cases a user might want to 
have multiple runs executing at all times (for example, a job with multiple 
stages which uses depends_on_past for continuous pipelined execution). 
Additionally, a similar hazard already exists with schedule_interval="* * * * 
*" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic 
tasks (especially when we get "depth-first execution" working rather than with 
spawning multiple dag runs). The difference vs. the `* * * * *` is that there 
we know that the speed of creating of the DagRuns will be 1/minute. Full stop. 
With "continuous" allowing for multiple dag runs - what is the speed of 
creation of those? Can we limit that? In `* * * * *` case, every dag run  has 
it's own "per minute" logical date and displaying it in the UI makes perfect 
sense.  In case of "continuous" we can of course enforce monotonic time for 
logical date somehow but having 100s of dagruns differing by microseconds 
sounds strange (and we will have a very likely race problem with multiple 
schedulers trying to create dag runs with exact-same logical date at the same 
tim)
   
   I think when you try to describe it in the docs in the way to explain the 
users intricacies of such approach where you have more than one 
"max_active_run" and explain in detail what they can expect - you will see the 
problem clearly.  For example you will have to explain that the speed of 
ramping up such running dagruns and maximum number of parallel runs will depend 
not only on scheduler parameters 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule
 but also on other DAGs being scheduled at the same time (because they will be 
competing for the same dag run limit), the speed of DB, scheduler and number of 
schedulers you run - all connected in a non-trivial way and next to impossible 
to control by the user as it will change dynamically. Contrast this with 
dynamic task mapping where you have complete control not only over the maximum 
number of task instances but hopefully in near future also independently for 
each mapped tas
 k group how many of those tasks can run in parallell : 
https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will 
have. If people will start using continuous to  spawn massive number of 
parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely 
not ready for that. Set max_active_run to 1000 and try to visualize those 
dagruns in the current UI. That won't work. No pagination, no way to limit the 
results and navigate between them, no way to really distinguish one dag run 
from another. This is - again in stark contrast in all the investment that 
@bbovenzi made into making dynamic task UI navigable and usable when you have 
100s and 1000s dynamically mapped tasks (Task Groups soon I hope). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 
and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Reply via email to