Not in favour of a special marker because that’s essentially what start_date is for. Say somebody has a schedule_interval=timedelta(days=1) and wants their DAG to run at 00:00 without having to think of a specific start date, then they’d have to set start_date="random date and time 00:00" and catchup=False.
I think we have the following options when no start_date is given: schedule_interval is alias e.g. “@daily” —> is a cron expression internally (0 0 * * *), so run at 00:00 schedule_interval is cron e.g. “0 0 * * *” —> cron expression determines when to run, 00:00:00 here schedule_interval is timedelta e.g. “timedelta(days=1)” —> only here we have no clear start_date and need something as a cutoff time, would use first added date as start_date, e.g. 12:34:56 So that would still result in deterministic DAG runs. Bas > On 13 May 2022, at 20:43, Ping Zhang <pin...@umich.edu> wrote: > > "starts whenever you first deploy it", this makes dags nondeterministic. It > is true that currently it is very hard to achieve this. Maybe we could use a > special start_date marker to indicate this behavior so that users can be very > aware of what they are doing. > > There is also another case where start_date is required, if the > schedule_interval is a timedelta object. > > > Thanks, > > Ping > > > On Fri, May 13, 2022 at 5:32 PM Collin McNulty <col...@astronomer.io.invalid> > wrote: > I disagree, start_date is None and catchup=True still describes a useful > behavior that’s currently difficult to achieve in Airflow: a DAG that starts > whenever you first deploy it and then catches up missed runs if you pause and > unpause it or have downtime. > > On Thu, May 12, 2022 at 5:49 AM Jarek Potiuk <ja...@potiuk.com > <mailto:ja...@potiuk.com>> wrote: > Yeah. Maybe simply start_date should only be required when catchup=True then? > Sounds like it might correctly reflect the intention of catchup=True, while > bringing a very solid semantic for explicit start_date. > > J. > > > On Tue, May 10, 2022 at 11:14 PM Ping Zhang <pin...@umich.edu > <mailto:pin...@umich.edu>> wrote: > I agree that for the crontab interval with `catchup=False`, the state_date > does not make sense. However, the start_date is still very useful when having > catchup=True, whose default value is `True`, > https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989 > > <https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989>. > If the stae_date defaults to None, this makes the dag not-portable, since > the start_date could be different in different airflow envs. > > If we want to default the state_date to None, we need some rules to let users > know in some cases start_date cannot be None. > > > Thanks, > > Ping > > > On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <ja...@potiuk.com > <mailto:ja...@potiuk.com>> wrote: > Coincidentally - this discussion in Github Discussions started just now has a > clear use cases when omitting start_date makes perfect sense: > https://github.com/apache/airflow/discussions/23594 > <https://github.com/apache/airflow/discussions/23594> > On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <b...@astronomer.io.invalid> > wrote: > I never understood the requirement for start_date — 99% of the use cases > simply want to start from the time the DAG is first added and do not > explicitly need to start on a certain date. There is certainly a use case for > start_date, but defaulting to None would make more sense IMO, and we could > internally register the “first added date” as a start date instead. > > Bas > >> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com >> <mailto:ja...@potiuk.com>> wrote: >> >> I think the only real need for start_date is the "catchup=True". >> I think start_date is really part of the metadata of the DAG - that is >> really useful in order to determine range of backfill for example. So it's >> more an intention of the DAG author to describe when we actually want the >> DAG livecycle started. >> As such it is nice to keep in the "records" - if we do not have it, we >> simply do not know when the DAG should "start". I mean - we could see it by >> historical DagRuns, but the problem is that if DagRuns are removed, that >> information is lost. >> >> But it does not have to be specified in the DAG() object in Python IMHO >> >> I do not think we should actually remove the "start_dag" from Dag model, but >> also I think it should be perfectly fine to simply set start_date in Dag >> model to "NOW()" if it is not passed. the NOW() should not be NOW() really I >> think - because of the intricacies of "execution_date" "start_interval", >> "end_interval" it should be automatically adjusted. And here I am not sure >> exactly - either so that when you create a DAG without start_date, it starts >> immediately for the current interval, or starts for the future interval (not >> 100% sure how well it will play with custom timetables but I think it can be >> worked out rather easily. >> >> J. >> >> >> >> On Thu, May 5, 2022 at 2:30 PM Malthe <mbo...@gmail.com >> <mailto:mbo...@gmail.com>> wrote: >> There's been some prior discussion on removing the requirement for a >> DAG without a schedule: >> >> - https://issues.apache.org/jira/browse/AIRFLOW-3739 >> <https://issues.apache.org/jira/browse/AIRFLOW-3739> >> - https://github.com/apache/airflow/pull/5423 >> <https://github.com/apache/airflow/pull/5423> >> >> But why actually have the requirement at all. >> >> The documentation isn't particularly clear on why we need "start_date" >> and the whole idea seems somewhat confusing: >> >> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date >> >> <https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date> >> >> Consider: >> >> croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime) >> >> My UTC time is "2022-05-05T12:22:16.914769" and the above expression >> evaluates to: >> >> 2022-05-05T12:25:00 >> >> That is, it's nicely aligned as you would expect. I would assume from >> reading the code that this carries over to `CronDataIntervalTimetable` >> since it uses croniter in exactly this way. >> >> Must we require a "start_date" – ? > > -- > > Collin McNulty > Lead Airflow Engineer > > Email: col...@astronomer.io <mailto:john....@astronomer.io> > Time zone: US Central (CST UTC-6 / CDT UTC-5) > > > <https://www.astronomer.io/>