[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

via GitHub Sun, 05 Mar 2023 01:31:33 -0800


potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546



##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date 
and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying 
max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an 
artificial limitation like this. I think in some cases a user might want to 
have multiple runs executing at all times (for example, a job with multiple 
stages which uses depends_on_past for continuous pipelined execution). 
Additionally, a similar hazard already exists with schedule_interval="* * * * 
*" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic 
tasks (especially when we get "depth-first execution" working rather than with 
spawning multiple dag runs. The difference vs. the "* * * * * *" is that there 
we know that the speed of creating of the DagRuns will be 1/minute. Full stop. 
With "continuous" allowing for multiple dag runs - what is the speed of 
creation of those? Can we limit that? In schedule case every dag run  has it's 
own "per minute" logical date and displaying it in the UI makes perfect sense.  
In case of "continuous" we can of course enforce monotonic time for logical 
date somehow but having 100s of dagruns differing by microseconds sounds 
strange.
   
   I think when you try to describe it in the docs in the way to explain the 
users intricacies of such approach where you have more than one 
"max_active_run" and explain in detail what they can expect. For example you 
will have to explain that the speed of ramping up such running dagruns and 
maximum number of parallel runs will depend not only on scheduler parameters 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule
 but also on other DAGs being scheduled at the same time (because they will be 
competing for the same dag run limit), the speed of scheduler and number of 
schedulers - all connected in a non-trivial way and next to impossible to 
control by the user as it will change dynamically. Contrast this with dynamic 
task mapping where you have complete control not only over the maximum number 
of task instances but hopefully in near future also independently for each 
mapped task group how many of those tasks can run in paral
 lell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will 
have. If people will start using continuous to  spawn massive number of 
parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely 
not ready for that. Set max_active_run to 1000 and try to visualize those 
dagruns in the current UI. That won't work. No pagination, no way to limit the 
results and navigate between them, no way to really distinguish one dag run 
from another. This is - again in stark contrast in all the investment that 
@bbovenzi made into making dynamic task UI navigable and usable when you have 
100s and 1000s dynamically mapped tasks (Task Groups soon I guess). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 
and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Reply via email to