potiuk commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924093846


   > I don't see the need for a dedicated backfill process to run, the 
scheduler could take care of that I believe, if tasks and dags are idempotent 
they don't even need to care about execution order, if order matters I guess 
`depends_on_past` should be set on the tasks(?) and the scheduler should handle 
it(?)
   
   Not currently as far as I understand how Scheduler works currently. The 
Scheduler currently is DAG based,  not individual task based. It looks at the 
DAGs and task dependencies for the "future" runs, schedules and executes them. 
There is no way (as I understand how scheduler runs) to get it start, monitor, 
send for exacution and overlook to completion selected tasks from selected dag 
from the past. The current architecture is that scheduler only looks ahead 
(possibly starting from the past if the dag has never been run) at the DAGs and 
determines which are the next tasks should be run for it and sends them to 
executors to execute - but there is no past scheduling for selected tasks). The 
database queries, scheduler loop, selecting which tasks to run next and when 
are heavily optimized for that use case and you would not be able to use it for 
re-running tasks without pretty much complete overhaul.
   
   But maybe I am wrong, and do not understand well enough how scheduler works. 
I am happy to get corrected if I am wrong here - would love to hear from others 
who understand better how scheduler work.
   
   From what I know this will likely change in the future, when Scheduler will 
become more "task based" (this is planned and will likely be implemented in 2.3 
or 2.4) and once this is done, the behaviour you describe will be possible, but 
it's quite a big effort and changing behaviour of scheduler, as well as 
allowing DAG versioning, and this is yet another reason why implementing 
backfill now is basically a lost effort as it will have to be re-implemented. 
So either we implement it as a "tactical" solution now quicklly - with the 
management of backfill process separately from scheduler - with limited effort 
and reusing a code that others developed (see @kimyen) or we wait with that 
until the task-based scheduler becomes a reality an reconsider it then IMHO.
   
   I uderstand it's important for you to run backfill, but certainly the 
"afterthought"  for me is that this is something you anyway have to trigger and 
overlook manually, and it something that is usually managed and run by a very 
small number of users who have special permission and access usually and not 
something that is needed by all the people who write and observe the DAGS. The 
audience here is far smaller and this is yet another justification that CLI is 
"good enough" for now I think. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to