[jira] [Updated] (AIRFLOW-5538) Add a flag to make scheduling trigger on start_date instead of execution_date (make execution_date equal to start_date)

kasim (Jira) Sun, 22 Sep 2019 20:26:27 -0700


     [ 
https://issues.apache.org/jira/browse/AIRFLOW-5538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


kasim updated AIRFLOW-5538:
---------------------------
    Description: 
>From [https://airflow.apache.org/scheduler.html] :

> Note that if you run a DAG on a schedule_interval of one day, the run
 > stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
 > other words, the job instance is started once the period it covers has
 > ended.

This feature is very hurt .

For example I have etl job which run every day, schedule_interval is `0 1 * * 
*`, so it will trigger 2019-09-22 01:00:00 job on 2019-09-23 01:00:00 . But my 
etl is processing all data before start_date , means data range is between 
(history, 2019-09-23 00:00:00) , and we can't use `datetime.now()` because this 
is unable to reproduce. This force me add 1 day to execution_date:
 ```python
 etl_end_time = "\{{ (execution_date + 
macros.timedelta(days=1)).strftime('%Y-%m-%d 00:00:00') }}"
 ```

However, when I need run a job with schedule_interval `45 2,3,4,5,6 * * *` , 
the `2019-09-22 06:45:00` job would run on `2019-09-23 02:45:00`, which is the 
day after execution_date . Instead of adding a day, I had to changed 
schedule_interval `45 2,3,4,5,6,7 * * *` and put a dummy operator on last run.
 And in this situation , you don't need add one day to execution_date , this 
means you have to define two `etl_end_time` to represent a same date in jobs 
with different schedule_interval .

All these are very uncomfortable for me , adding a config or built-in method to 
make execution_date equal to start_date. would be very nice . 

  was:
>From https://airflow.apache.org/scheduler.html :

> Note that if you run a DAG on a schedule_interval of one day, the run
> stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
> other words, the job instance is started once the period it covers has
> ended.

This feature is very hurt .

For example I have etl job which run every day, schedule_interval is `0 1 * * 
*`, so it will trigger 2019-09-22 01:00:00 job on 2019-09-23 01:00:00 . But my 
etl is processing all data before start_date , means data range is between 
(history, 2019-09-23 00:00:00) , and we can't use `datetime.now()` because this 
is unable to reproduce. This force me add 1 day to execution_date:
```python
etl_end_time = "\{{ (execution_date + 
macros.timedelta(days=1)).strftime('%Y-%m-%d 00:00:00') }}"
```

However, when I need run a job with schedule_interval `45 2,3,4,5,6 * * *` , 
the `2019-09-22 06:45:00` job would run on `2019-09-23 02:45:00`, which is the 
day after execution_date . Instead of adding a day, I had to changed 
schedule_interval `45 2,3,4,5,6,7 * * *` and put a dummy operator on last run.
And in this situation , you don't need add one day to execution_date , this 
means you have to define two `etl_end_time` to represent a same date in jobs 
with different schedule_interval .

All these are very uncomfortable for me , is there any config or built-in 
method to make execution_date equal to start_date ? Or I have to modify airflow 
source code ...

 


> Add a flag to make scheduling trigger on start_date instead of execution_date 
> (make execution_date equal to start_date)
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5538
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5538
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: DagRun
>    Affects Versions: 1.10.5
>            Reporter: kasim
>            Priority: Major
>
> From [https://airflow.apache.org/scheduler.html] :
> > Note that if you run a DAG on a schedule_interval of one day, the run
>  > stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In
>  > other words, the job instance is started once the period it covers has
>  > ended.
> This feature is very hurt .
> For example I have etl job which run every day, schedule_interval is `0 1 * * 
> *`, so it will trigger 2019-09-22 01:00:00 job on 2019-09-23 01:00:00 . But 
> my etl is processing all data before start_date , means data range is between 
> (history, 2019-09-23 00:00:00) , and we can't use `datetime.now()` because 
> this is unable to reproduce. This force me add 1 day to execution_date:
>  ```python
>  etl_end_time = "\{{ (execution_date + 
> macros.timedelta(days=1)).strftime('%Y-%m-%d 00:00:00') }}"
>  ```
> However, when I need run a job with schedule_interval `45 2,3,4,5,6 * * *` , 
> the `2019-09-22 06:45:00` job would run on `2019-09-23 02:45:00`, which is 
> the day after execution_date . Instead of adding a day, I had to changed 
> schedule_interval `45 2,3,4,5,6,7 * * *` and put a dummy operator on last run.
>  And in this situation , you don't need add one day to execution_date , this 
> means you have to define two `etl_end_time` to represent a same date in jobs 
> with different schedule_interval .
> All these are very uncomfortable for me , adding a config or built-in method 
> to make execution_date equal to start_date. would be very nice . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-5538) Add a flag to make scheduling trigger on start_date instead of execution_date (make execution_date equal to start_date)

Reply via email to