potiuk commented on issue #18317: URL: https://github.com/apache/airflow/issues/18317#issuecomment-924093846
> I don't see the need for a dedicated backfill process to run, the scheduler could take care of that I believe, if tasks and dags are idempotent they don't even need to care about execution order, if order matters I guess `depends_on_past` should be set on the tasks(?) and the scheduler should handle it(?) Not currently as far as I understand how Scheduler works currently. The Scheduler currently is DAG based, not individual task based. It looks at the DAGs and task dependencies for the "future" runs, schedules and executes them. There is no way (as I understand how scheduler runs) to get it start, monitor, send for exacution and overlook to completion selected tasks from selected dag from the past. The current architecture is that scheduler only looks ahead (possibly starting from the past if the dag has never been run) at the DAGs and determines which are the next tasks should be run for it and sends them to executors to execute - but there is no past scheduling for selected tasks). The database queries, scheduler loop, selecting which tasks to run next and when are heavily optimized for that use case and you would not be able to use it for re-running tasks without pretty much complete overhaul. But maybe I am wrong, and do not understand well enough how scheduler works. I am happy to get corrected if I am wrong here - would love to hear from others who understand better how scheduler work. From what I know this will likely change in the future, when Scheduler will become more "task based" (this is planned and will likely be implemented in 2.3 or 2.4) and once this is done, the behaviour you describe will be possible, but it's quite a big effort and changing behaviour of scheduler, as well as allowing DAG versioning, and this is yet another reason why implementing backfill now is basically a lost effort as it will have to be re-implemented. So either we implement it as a "tactical" solution now quicklly - with the management of backfill process separately from scheduler - with limited effort and reusing a code that others developed (see @kimyen) or we wait with that until the task-based scheduler becomes a reality an reconsider it then IMHO. I uderstand it's important for you to run backfill, but certainly the "afterthought" for me is that this is something you anyway have to trigger and overlook manually, and it something that is usually managed and run by a very small number of users who have special permission and access usually and not something that is needed by all the people who write and observe the DAGS. The audience here is far smaller and this is yet another justification that CLI is "good enough" for now I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
