potiuk commented on pull request #19361: URL: https://github.com/apache/airflow/pull/19361#issuecomment-968392532
I thought a bit more about it and also looked at #17010 and I side with @ashb on that one. I really think adding time-based task scheduling bring very little value and we should not not do it this way. I think there are two things here: 1) being able to choose (based on a crontab) to determine the "path" the DAG should follow for the "current date" (we need to define what the current date is here BTW) 2) We want to make the scheduling decision in this case much faster and with less overhead that could be done by - say PythonBranch operator. Such decisions could be taken by scheduler if the task-based scheduling is implemented and it shoud be done in "declarative" rather than "imperative" way. Now - I think 2) is really an optimisation, and I would not not really like to change whole Task definition and add complexity to scheduler. And I think we can do it much simpler - following the implementation we'v done for DummyOperator. Why don't we make specialized `CronBranchOperator`? This operator could make Branching decisions similarly as PythonBranchOperator, but we could optimize it in the scheduler and rather than execute the task, the scheduler could evaluate the declarative specification of the operator during scheduling and set the state of the tasks apropriately witout the overhead of running the task. I am not 100% sure - but I think this woudl be a relatively simple change in Scheduler and it would not necessary require any serious modification in Airflow's behaviour. We could even add a set of other similar "FastScheduled, declarative" tasks for other similar cases, where the scheduler could "evaluate" such task and perform all the subsequent scheduling decision immediatley after. And from Dag writer point of view it is even better, because it feels much more natural IMHO to choose a "path" in the DAG based on such Task whch clearly specifies "I decide on the path based on time", rather than decide based on "task attribute" as would be in case we introduce task-level include/exclude. It would also decouple the "time-dag-processing" from all the tasks and has the opportunity of better showing all time-based decisions in the UI. IMHO tasks should only do one thing and we seem to ask them to do more than one thing - decide whether to run and run. Correct me if I am wrong @ashb but I believe the way how Scheduler works now is ideally suited for this kind of optimisation - it runs in "mini-batches" and such state update for declarative evaluation of "fixed" task types is something that scheduler might cope with very well. I think tihs is eventually even more "extensible" than embedding task-level scheduling, because we could have more "specialized fast evaluated" tasks if we find the pattern works well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
