[
https://issues.apache.org/jira/browse/AIRFLOW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660420#comment-16660420
]
Daniel Lamblin commented on AIRFLOW-3244:
-----------------------------------------
For context from a slack conversation: Alberto is suggesting that given a
weekly DAG, he wants his DAG runs dated at the start of the week, Monday.
However there's 2 days of settling time for the data Monday to Monday
(midnight) so must set a schedule of Wednesday (midnight) at least. He then
uses the date range {{execution_date}} -2d to {{execution_date}} +5d when
processing the data. However his UI shows him dates on Wednesdays, although the
run's processed data is actually for Monday to Monday.
To work around this, he tried a schedule set to Monday, and used a
{{TimeDeltaSensor}} of 2 days at the start of his DAG. This ties up a worker
process for the whole two days, and isn't efficient, particularly if he has
100s of such reports to define.
He proposes instead, as a concept for a feature, that a DAG specification could
take an {{execution_date_timedelta}} (or something better named) where although
the DAG is scheduled exactly as it would be according to its
{{schedule_interval}} and {{start_date}}, the {{dagrun}} instance
{{execution_time}} passed into the DAG's tasks, and recorded as the run date in
the UI would have a {{timedelta}} applied to it.
This way he could specify a weekly schedule of Wednesday at midnight, and with
a -2d time delta his DAG could query from {{execution_date}} to
{{next_execution_date}} or {{ds}} to {{next_ds}} and have that mean Monday to
Monday. He could also click on a tree view run that is 3 Mondays ago instead of
3 Wednesdays ago.
Personally I think clearly separating the scheduled start time from the
execution date further would help keep the concept clearly separate and promote
more idempotent use of DAGs. I think that this means the scheduler needs to do
some extra math when looking if the prior runs for a schedule exist, and when
creating a DAG run.
> Introduce offset on the execution date for data assessment
> ----------------------------------------------------------
>
> Key: AIRFLOW-3244
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3244
> Project: Apache Airflow
> Issue Type: Improvement
> Components: DAG
> Affects Versions: 1.10.0
> Reporter: Alberto Anceschi
> Priority: Minor
> Labels: features, request
>
> Hi everyone,
>
> I'm trying to port my current cronjobs into Airflow. Let's consider a real
> case scenario: I've to send every week a report and through the pipeline data
> from Google Analytics needs to be collected, so I need 2 days before running
> the DAG (data assessment). Week starts on Monday and ends on Sunday, so I
> need the DAG to run on Wednesday at Midnight UTC.
> In order to see on the Airflow dashboard start_date/exection_date that make
> sense to me, for now I've used a TimeDeltaSensor that adds that 2 day offset
> I need, but this is not its purpose. I also use Celery executor, so its
> workers keep polling during those 2 days, making them unavailable for other
> DAGs.
> I think that the assumption that at the end of the period scheduled data are
> ready is not correct and at the same time it's much more intuitive seeing on
> the dashboard Monday execution dates instead of Tuesday ones.
>
> What do you think about this request? Any suggestion? Thank you,
>
> Alberto
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)