dstandish commented on PR #35887:
URL: https://github.com/apache/airflow/pull/35887#issuecomment-1843549691
Well, let's assess and articulate what the actual behavior change is. We'll
need that anyway, however we categorize it.
I took a look at your code and refactored a bit to better understand what
the actual change is.
Here's my version
```python
from __future__ import annotations
from datetime import datetime
import pendulum
from airflow.timetables.interval import CronDataIntervalTimetable
# Zurich (Switzerland) is chosen since it is +1/+2 DST, making it a bit
easier
# to get my head around the mental timezone conversion.
# In 2023, DST entered on 26th Mar, 2am local clocks (1am UTC) were turned
# forward to 3am. DST exited on 29th Oct, 3am local clocks (1am UTC) were
# turned backward to 2am (making the 2:XX hour fold).
CET = "Europe/Zurich"
def to_local_iso(val: str):
return pendulum.parse(val).in_timezone(CET).isoformat()
def run_check(starting_run, timetable, exp_utc_iso):
print(f"when last run is\t\t\t{starting_run.in_timezone(CET)} /
{starting_run}...")
exp_local = to_local_iso(exp_utc_iso)
print(f"then next run should be\t\t{exp_local} / {exp_utc_iso}")
next_run = timetable._get_next(starting_run)
if (dt := next_run.isoformat()) and dt == exp_utc_iso:
print("and it is!")
else:
print(f"but instead it is\t\t\t{to_local_iso(dt)} / {dt}")
delta = (pendulum.parse(dt) -
pendulum.parse(exp_utc_iso)).in_minutes()
print(f"this is {abs(delta)} minutes {'ahead of' if delta < 0 else
'behind'} schedule")
print()
timetable_irregular = CronDataIntervalTimetable("0 0,3 * * *",
timezone=pendulum.timezone(CET))
starting_run1 = pendulum.datetime(2023, 3, 26, 0, tz=CET).in_timezone("UTC")
run_check(
starting_run=starting_run1,
timetable=timetable_irregular,
exp_utc_iso="2023-03-26T01:00:00+00:00",
)
starting_run2 = pendulum.datetime(2023, 10, 29, 0, tz=CET).in_timezone("UTC")
expected_next_run = (
pendulum.instance(datetime(2023, 10, 29, 3, fold=1,
tzinfo=pendulum.timezone(CET)))
.in_timezone("UTC")
.isoformat()
)
assert expected_next_run == "2023-10-29T02:00:00+00:00"
run_check(
starting_run=starting_run2,
timetable=timetable_irregular,
exp_utc_iso=expected_next_run, # runs on the second occurrence of 3am
)
```
and i think this is what the change boils down to...
before the change this is the output:
```
when last run is 2023-03-26T00:00:00+01:00 /
2023-03-25T23:00:00+00:00...
then next run should be 2023-03-26T03:00:00+02:00 /
2023-03-26T01:00:00+00:00
but instead it is 2023-03-26T04:00:00+02:00 /
2023-03-26T02:00:00+00:00
this is 60 minutes behind schedule
I
when last run is 2023-10-29T00:00:00+02:00 /
2023-10-28T22:00:00+00:00...
then next run should be 2023-10-29T03:00:00+01:00 /
2023-10-29T02:00:00+00:00
but instead it is 2023-10-29T02:00:00+01:00 /
2023-10-29T01:00:00+00:00
this is 60 minutes ahead of schedule
```
Caveat, this is all very confusing and hard to wrap one's head around but
that said, let's discusse the two cases.
In case 1, spring forward, the next run is supposed to at 3am (2 hours after
the last run, since an hour is skipped) but instead the next run is calculated
as 4am, which is simply the wrong time by any assessment. So one behavior
change is, instead of running at 4am, it will run at 3am, the correct clock
time. This to me seems like a bugfix.
Side note: we might also want to verify the behavior for a `0,2` schedule,
since i think in spring ahead, there is never a 2am.
In case 2, fall back, the next run is supposed to be the 3am. In actuality,
this is an unambiguous time reference. At fall back, the clock only strikes
3am one time. I.e. after 2:59 am, it goes to 2:00 again -- not first to 3am
and then back to 2am. Which is to say, the next run is at the wrong clock time
again -- 2am instead of 3am. So again, to me this looks like a bugfix.
The other thing to note is that we also observed this year that with fall
back, the dag would actually get stuck (if catchup=True, then forever, or until
next run is past transition otherwise). So, presuming this fixes that, it's a
bugfix on another layer.
Note, I'm not arguing to "win", just trying to further our own
understandings of what this is and how best to represent this to users. Do I
understand this correctly? Do you think I'm describing the details correctly?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]