gor-obr opened a new issue, #7999:
URL: https://github.com/apache/airflow/issues/7999

   **Apache Airflow version**: 1.10.9, 2.0.0dev
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**:
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**:
   **What happened**:
   
   If DAG's cron contains "non trivial" hours section, Scheduler will not 
schedule DAG correctly immediately after DST switch. In this case, 
"non-trivial" means either having an interval (`0 7-8 * * *`) or multiple 
values (`0 7,9 * * *`).
   
   If DST occurs in morning (2-3 AM of March 8th), following following 
executions will occur:
   
   ```
   0 7-8 * * *
   "2020-03-07T07:00:00-08:00"
   "2020-03-07T08:00:00-08:00" # DST switch after this run
   "2020-03-08T08:00:00-07:00" # 8 AM instead of 7 AM
   
   0 7,9 * * *
   
   "2020-03-07T07:00:00-08:00"
   "2020-03-07T09:00:00-08:00"
   "2020-03-08T07:00:00-07:00" # DST switch after this run
   "2020-03-08T08:00:00-07:00" # 8 AM instead of 7 AM
   ```
   
   Cause for this is the method `is_fixed_time_schedule` in `dag.py`, which 
tests whether cron is "fixed" (i.e. "execute exactly at this time") or 
"relative" (i.e. "execute on each n hours"). Method relies on a quite crude 
test, it calculates two subsequent times and checks whether both hours and 
minutes of them are the same:
   
   ```python
           now = datetime.now()
           cron = croniter(self._schedule_interval, now)
   
           start = cron.get_next(datetime)
           cron_next = cron.get_next(datetime)
   
           if cron_next.minute == start.minute and cron_next.hour == start.hour:
               return True
   
           return False
   ```
   
   This is not satisfied in case of above examples (it is executed at two 
different hours during each day, so hours in subsequent executions are never 
the same).
   
   Based on this, method `following_schedule` in `dag.py` thinks that this DAG 
should be executed "on each n hours", and explicitly works around DST. It 
calculates the amount of time which needs to pass until next run, and adds that 
amount of time to the previous run, thus ignoring DST.
   
   ```python
               # We assume that DST transitions happen on the minute/hour
               if not self.is_fixed_time_schedule():
                   # relative offset (eg. every 5 minutes)
                   delta = cron.get_next(datetime) - naive
                   following = 
dttm.in_timezone(self.timezone).add_timedelta(delta)
               else:
                   # absolute (e.g. 3 AM)
                   naive = cron.get_next(datetime)
                   tz = pendulum.timezone(self.timezone.name)
                   following = timezone.make_aware(naive, tz)
   ```
   
   Note that only the first execution (or first few executions) will be 
affected. After them, the calculation will stabilize and will work correctly 
going forward.
   
   This is describes the same issue as 
https://issues.apache.org/jira/browse/AIRFLOW-7039
   
   
   **What you expected to happen**:
   
   <!-- What do you think went wrong? -->
   
   **How to reproduce it**:
   
   A failing unit test which demonstrates the issue is pushed here:
   
   
https://github.com/gor-obr/airflow/commit/0d021a1304223711ba3ece980bcc1b18df4adcd0
   
   **Anything else we need to know**:
   
   The immediate issue can be fixed in several ways:
   
   * Alter the logic of `is_fixed_time_schedule` so that it parses the cron 
string and checks whether it contains `/` in minute or hour sections.
   
   * Extend the `Dag` class so that its constructor accepts an optional `bool` 
argument `is_fixed_time_schedule`. If this argument is provided, it will be 
used instead of the logic in `is_fixed_time_schedule` method. If argument is 
not provided, the behavior will continue to be as is now. This way user can 
work around the issue, while existing DAGs still working as they used to.
   
   However, both of these fixes are a bit hacky in their nature and I'd say 
that they don't address the underlying issue.
   
   Problem is in the fact that cron actually doesn't recognize "relative" 
schedules. In cron, all schedules are absolute, and say `0 */6 * * *` is just a 
syntactic  sugar for `0 0,6,12,18 * * *`. In cron these two schedules are 
equivalent, and in Airflow this syntactic sugar is overloaded with semantic 
meaning (first is assumed to be a "relative" schedule and second is assumed to 
be "fixed" schedule).
   
   This is highlighted by the fact that Airflow uses croniter to calculate 
schedule, but then (in higher level of abstraction) works around this library 
and fiddles with calculated results (there is even a vague comment: ` # we 
don't want to rely on the transitions created by croniter as they are not 
always correct`).
   
   Maybe the cleanest solution would be to accept that all cron schedules are 
in their nature absolute, and to extract the feature of relative scheduling 
from the cron. Leave cron to work with absolute schedules as it was designed to 
work, and introduce different syntax to allow for relative scheduling feature.
   
   The obvious downside of this approach is that it would change the behavior 
of already existing DAGs. Some good news is that in most cases absolute and 
relative scheduling actually have the same result (they differ mostly in 
handling the DST).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to