[ 
https://issues.apache.org/jira/browse/AIRFLOW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660420#comment-16660420
 ] 

Daniel Lamblin commented on AIRFLOW-3244:
-----------------------------------------

For context from a slack conversation: Alberto is suggesting that given a 
weekly DAG, he wants his DAG runs dated at the start of the week, Monday.
However there's 2 days of settling time for the data Monday to Monday 
(midnight) so must set a schedule of Wednesday (midnight) at least. He then 
uses the date range {{execution_date}} -2d to {{execution_date}} +5d when 
processing the data. However his UI shows him dates on Wednesdays, although the 
run's processed data is actually for Monday to Monday.

To work around this, he tried a schedule set to Monday, and used a 
{{TimeDeltaSensor}} of 2 days at the start of his DAG. This ties up a worker 
process for the whole two days, and isn't efficient, particularly if he has 
100s of such reports to define.

He proposes instead, as a concept for a feature, that a DAG specification could 
take an {{execution_date_timedelta}} (or something better named) where although 
the DAG is scheduled exactly as it would be according to its 
{{schedule_interval}} and {{start_date}}, the {{dagrun}} instance 
{{execution_time}} passed into the DAG's tasks, and recorded as the run date in 
the UI would have a {{timedelta}} applied to it.

This way he could specify a weekly schedule of Wednesday at midnight, and with 
a -2d time delta his DAG could query from {{execution_date}} to 
{{next_execution_date}} or {{ds}} to {{next_ds}} and have that mean Monday to 
Monday. He could also click on a tree view run that is 3 Mondays ago instead of 
3 Wednesdays ago.

Personally I think clearly separating the scheduled start time from the 
execution date further would help keep the concept clearly separate and promote 
more idempotent use of DAGs. I think that this means the scheduler needs to do 
some extra math when looking if the prior runs for a schedule exist, and when 
creating a DAG run.

> Introduce offset on the execution date for data assessment
> ----------------------------------------------------------
>
>                 Key: AIRFLOW-3244
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3244
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: DAG
>    Affects Versions: 1.10.0
>            Reporter: Alberto Anceschi
>            Priority: Minor
>              Labels: features, request
>
> Hi everyone,
>  
> I'm trying to port my current cronjobs into Airflow. Let's consider a real 
> case scenario: I've to send every week a report and through the pipeline data 
> from Google Analytics needs to be collected, so I need 2 days before running 
> the DAG (data assessment). Week starts on Monday and ends on Sunday, so I 
> need the DAG to run on Wednesday at Midnight UTC.
> In order to see on the Airflow dashboard start_date/exection_date that make 
> sense to me, for now I've used a TimeDeltaSensor that adds that 2 day offset 
> I need, but this is not its purpose. I also use Celery executor, so its 
> workers keep polling during those 2 days, making them unavailable for other 
> DAGs.
> I think that the assumption that at the end of the period scheduled data are 
> ready is not correct and at the same time it's much more intuitive seeing on 
> the dashboard Monday execution dates instead of Tuesday ones.
>  
> What do you think about this request? Any suggestion? Thank you,
>  
> Alberto
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to