Bolke de Bruin updated AIRFLOW-20:
    Summary: Improving the scheduler by make dag runs more coherent  (was: 
Align start_date with the schedule_interval)

> Improving the scheduler by make dag runs more coherent
> ------------------------------------------------------
>                 Key: AIRFLOW-20
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-20
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Bolke de Bruin
>              Labels: backfill, database, scheduler
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the 
> documentation. If we are
> able to fix this with none or little consequences for current setups that 
> should be preferred, I think.
> The dependency explainer is really great work, but it doesn’t address the 
> core issue.
> If you consider a DAG a description of cohesion between work items (in OOP 
> java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java 
> terms an instance). 
> Tasks are then the description of a work item and a TaskInstance the 
> instantiation of the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the 
> TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation 
> to other TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. 
> Tasks are instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns 
> itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved 
> while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of 
> things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to 
> start looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know 
> if it is really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its 
> predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the 
> DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on 
> the interval then start_date is the first run date

This message was sent by Atlassian JIRA

Reply via email to