mpeteuil opened a new issue #14969:
URL: https://github.com/apache/airflow/issues/14969


   **Apache Airflow version**: 1.8 - 2.0.1 (tested against 1.10.4, 1.10.15, 
2.0.1)
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl 
version`): N/A
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: 
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**: Python 2.7.16, 3.7.6 (I don't think this is a factor)
   
   **What happened**:
   
   There is an issue with the scheduling of DAGs that use a `timedelta` object 
as the DAG `schedule_interval` argument while also having `catchup` set to 
`False`. What happens is that if you have a DAG that meets that criteria then 
when it's turned on it will ignore the time component of the start date and 
just run immediately.
   
   This was previously reported in 
[[AIRFLOW-1156]](https://issues.apache.org/jira/browse/AIRFLOW-1156) and was 
closed with https://github.com/apache/airflow/pull/8776 which fixed the two dag 
runs problem that was also mentioned in that issue.
   
   **What you expected to happen**:
   
   I expect it to behave the same as a DAG using a cron expression for the 
`schedule_interval` under otherwise same conditions (i.e. `catchup` still set 
to `False`).
   
   I believe this is a result of how [`Dag#following_schedule` and 
`Dag#previous_schedule` are 
implemented](https://github.com/apache/airflow/blob/1.10.15/airflow/models/dag.py#L409-L463).
 I traced the `SchedulerJob#create_dag_run` method and I believe this is due to 
the `Dag` methods used in there.
   
   **How to reproduce it**:
   
   Create two dags with `catchup` set to `False` that are exactly the same 
except that one will use a `timedelta` object as the `schedule_interval` 
argument and the other will use a cron expression. Set a `start_date` of 
sometime in the past. Turn them both on and you should see the one with a 
`timedelta` as the `schedule_interval` has disregarded the time part of the 
`start_date` and used the current time when it started executing as the time 
part of the `execution_date`. The version using the cron expression will have 
used the time from the cron expression.
   
   Example DAG:
   
   ```py
   import datetime as dt
   
   from airflow import DAG
   from airflow.operators.dummy_operator import DummyOperator
   
   dag_params = {
       'dag_id': 'schedule_interval_timedelta_bug_example',
       'default_args':{
           'owner': 'Administrator',
           'depends_on_past': False,
           'retries': 0,
           'email': ['[email protected]']
       },
       'schedule_interval': dt.timedelta(days=1),
       'start_date': dt.datetime(year=2021, month=1, day=1, hour=11, minute=10),
       'catchup': False
   }
   
   with DAG(**dag_params) as dag:
       DummyOperator(task_id='start') >> DummyOperator(task_id='end')
   ```
   
   For the cron version just change the `schedule_interval` to `10 11 * * *`.
   
   Here's a screenshot of this happening on 2.0.1 (although the bug exists in 
much older versions as well):
   
   ![Screen Shot 2021-03-23 at 5 54 53 
PM](https://user-images.githubusercontent.com/459756/112223902-eb452d00-8c00-11eb-9380-60ccc8577ccb.png)
   
   
   **Anything else we need to know**:
   
   I've only tested this on DAGs that have a 1 day schedule interval, but 
testing with other intervals could reveal if this is a problem at finer grained 
intervals or if it's isolated to daily runs. Based on what I saw in 
`Dag#following_schedule` and `Dag#previous_schedule` I suspect this would be 
problem with shorter intervals as well.
   
   Tested with the `SequentialExecutor` and `StandardTaskRunner`, which I don't 
_think_ are a factor, but it's certainly possible.
   
   Happy to provide other details or help in any way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to