[ 
https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264547#comment-15264547
 ] 

Bolke de Bruin commented on AIRFLOW-20:
---------------------------------------

Max I understand your concerns, I'm working with Jeremiah to see if I can 
remove the need for the dag run id in taskinstances. I would be happy to as it 
would simplify the change. Rigorous testing needs to take place before making 
it part of a release, current unit tests do not cover enough. 

Documentation is forthcoming but I will only assemble it when all assumptions 
have been worked out and discussions finished. 

Please also note that we (Jeremiah and I) are seeing this as the first step 
towards his bigger pr. Sort of paving the way. If you can regard it in that 
context and not just "making it easier for schedules that move". 

In that regard the more fundamental questions I have also asked on the list 
have not been answered yet: what do we consider the unit of work? Is it a 
DagRun or is it a taskinstance? We talk about dags, but we can run tasks 
without a DagRun. 

I consider taskinstances part of a DagRun. They should not be able to look 
beyond the borders of the containing DagRun. If they want to do so they should 
query the DagRun for this. If you do this backfills and so on become much 
easier to handle. 

> Align start_date with the schedule_interval
> -------------------------------------------
>
>                 Key: AIRFLOW-20
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-20
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Bolke de Bruin
>              Labels: backfill, database, scheduler
>
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the 
> documentation. If we are
> able to fix this with none or little consequences for current setups that 
> should be preferred, I think.
> The dependency explainer is really great work, but it doesn’t address the 
> core issue.
> If you consider a DAG a description of cohesion between work items (in OOP 
> java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java 
> terms an instance). 
> Tasks are then the description of a work item and a TaskInstance the 
> instantiation of the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the 
> TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation 
> to other TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. 
> Tasks are instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns 
> itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved 
> while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of 
> things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to 
> start looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know 
> if it is really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its 
> predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the 
> DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on 
> the interval then start_date is the first run date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to