potiuk commented on pull request #19361:
URL: https://github.com/apache/airflow/pull/19361#issuecomment-968392532


   I thought a bit more about it and also looked at  #17010 and I side with 
@ashb on that one. I really think adding time-based task scheduling bring very 
little value and we should not not do it this way.
   
   I think there are two things here:
   
   1) being able to choose (based on a crontab) to determine the "path" the DAG 
should follow for the "current date" (we need to define what the current date 
is here BTW)
   2)  We want to make the scheduling decision in this case much faster and 
with less overhead that could be done by - say PythonBranch operator. Such 
decisions could be taken by scheduler if the task-based scheduling is 
implemented and it shoud be done in "declarative" rather than "imperative" way.
   
   Now - I think 2) is really an optimisation, and I would not not really like 
to change whole Task definition and add complexity to scheduler.  And I think 
we can do it much simpler - following the implementation we'v done for 
DummyOperator.
   
   Why don't we make specialized `CronBranchOperator`?  This operator could 
make Branching decisions similarly as PythonBranchOperator, but we could 
optimize it in the scheduler and rather than execute the task, the scheduler 
could evaluate the declarative specification of the operator during scheduling 
and set the state of the tasks apropriately witout the overhead of running the 
task.
   
   I am not 100% sure - but I think this woudl be a relatively simple change in 
Scheduler and it would not necessary require any serious modification in 
Airflow's behaviour. We could even add a set of other similar "FastScheduled, 
declarative" tasks for other similar cases, where the scheduler could 
"evaluate" such task and perform all the subsequent scheduling decision 
immediatley after. And from Dag writer point of view it is even better, because 
it feels much more natural IMHO to choose a "path" in the DAG based on such 
Task whch clearly specifies "I decide on the path based on time", rather than 
decide based on "task attribute" as would be in case we introduce task-level 
include/exclude. It would also decouple the "time-dag-processing" from all the 
tasks and has the opportunity of better showing all time-based decisions in the 
UI. IMHO tasks should only do one thing and we seem to ask them to do more than 
one thing - decide whether to run and run.
   
   Correct me if I am wrong @ashb but I believe the way how Scheduler works now 
is ideally suited for this kind of optimisation - it runs in "mini-batches" and 
such state update for declarative evaluation of "fixed" task types is something 
that scheduler might cope with very well. I think tihs is eventually even more 
"extensible" than embedding task-level scheduling, because we could have more 
"specialized fast evaluated" tasks if we find the pattern works well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to