Not in favour of a special marker because that’s essentially what start_date is 
for. Say somebody has a schedule_interval=timedelta(days=1) and wants their DAG 
to run at 00:00 without having to think of a specific start date, then they’d 
have to set start_date="random date and time 00:00" and catchup=False.

I think we have the following options when no start_date is given:
schedule_interval is alias e.g. “@daily” —> is a cron expression internally (0 
0 * * *), so run at 00:00
schedule_interval is cron e.g. “0 0 * * *” —> cron expression determines when 
to run, 00:00:00 here
schedule_interval is timedelta e.g. “timedelta(days=1)” —> only here we have no 
clear start_date and need something as a cutoff time, would use first added 
date as start_date, e.g. 12:34:56
So that would still result in deterministic DAG runs.

Bas

> On 13 May 2022, at 20:43, Ping Zhang <pin...@umich.edu> wrote:
> 
> "starts whenever you first deploy it", this makes dags nondeterministic. It 
> is true that currently it is very hard to achieve this. Maybe we could use a 
> special start_date marker to indicate this behavior so that users can be very 
> aware of what they are doing.
> 
> There is also another case where start_date is required, if the 
> schedule_interval is a timedelta object.
> 
> 
> Thanks,
> 
> Ping
> 
> 
> On Fri, May 13, 2022 at 5:32 PM Collin McNulty <col...@astronomer.io.invalid> 
> wrote:
> I disagree, start_date is None and catchup=True still describes a useful 
> behavior that’s currently difficult to achieve in Airflow: a DAG that starts 
> whenever you first deploy it and then catches up missed runs if you pause and 
> unpause it or have downtime. 
> 
> On Thu, May 12, 2022 at 5:49 AM Jarek Potiuk <ja...@potiuk.com 
> <mailto:ja...@potiuk.com>> wrote:
> Yeah. Maybe simply start_date should only be required when catchup=True then? 
>  Sounds like it might correctly reflect the intention of catchup=True, while 
> bringing a very solid semantic for explicit start_date. 
> 
> J.
> 
> 
> On Tue, May 10, 2022 at 11:14 PM Ping Zhang <pin...@umich.edu 
> <mailto:pin...@umich.edu>> wrote:
> I agree that for the crontab interval with `catchup=False`, the state_date 
> does not make sense. However, the start_date is still very useful when having 
> catchup=True, whose default value is `True`, 
> https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989
>  
> <https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989>.
>  If the stae_date defaults to None, this makes the dag not-portable, since 
> the start_date could be different in different airflow envs. 
> 
> If we want to default the state_date to None, we need some rules to let users 
> know in some cases start_date cannot be None.
> 
> 
> Thanks,
> 
> Ping
> 
> 
> On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <ja...@potiuk.com 
> <mailto:ja...@potiuk.com>> wrote:
> Coincidentally - this discussion in Github Discussions started just now has a 
> clear use cases when omitting start_date makes perfect sense: 
> https://github.com/apache/airflow/discussions/23594 
> <https://github.com/apache/airflow/discussions/23594>
> On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <b...@astronomer.io.invalid> 
> wrote:
> I never understood the requirement for start_date — 99% of the use cases 
> simply want to start from the time the DAG is first added and do not 
> explicitly need to start on a certain date. There is certainly a use case for 
> start_date, but defaulting to None would make more sense IMO, and we could 
> internally register the “first added date” as a start date instead.
> 
> Bas
> 
>> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com 
>> <mailto:ja...@potiuk.com>> wrote:
>> 
>> I think the only real need for start_date is the "catchup=True". 
>> I think start_date is really part of the metadata of the DAG - that is 
>> really useful in order to determine range of backfill for example. So it's 
>> more an intention of the DAG author to describe when we actually want the 
>> DAG livecycle started.
>> As such it is nice to keep in the "records" - if we do not have it, we 
>> simply do not know when the DAG should "start". I mean - we could see it by 
>> historical DagRuns, but the problem is that if DagRuns are removed, that 
>> information is lost.
>> 
>> But it does not have to be specified in the DAG() object in Python IMHO
>> 
>> I do not think we should actually remove the "start_dag" from Dag model, but 
>> also I think it should be perfectly fine to simply set start_date in Dag 
>> model to "NOW()" if it is not passed. the NOW() should not be NOW() really I 
>> think - because of the intricacies of "execution_date" "start_interval", 
>> "end_interval" it should be automatically adjusted. And here I am not sure 
>> exactly - either so that when you create a DAG without start_date, it starts 
>> immediately for the current interval, or starts for the future interval (not 
>> 100% sure how well it will play with custom timetables but I think it can be 
>> worked out rather easily.
>> 
>> J.
>> 
>> 
>> 
>> On Thu, May 5, 2022 at 2:30 PM Malthe <mbo...@gmail.com 
>> <mailto:mbo...@gmail.com>> wrote:
>> There's been some prior discussion on removing the requirement for a
>> DAG without a schedule:
>> 
>> - https://issues.apache.org/jira/browse/AIRFLOW-3739 
>> <https://issues.apache.org/jira/browse/AIRFLOW-3739>
>> - https://github.com/apache/airflow/pull/5423 
>> <https://github.com/apache/airflow/pull/5423>
>> 
>> But why actually have the requirement at all.
>> 
>> The documentation isn't particularly clear on why we need "start_date"
>> and the whole idea seems somewhat confusing:
>> 
>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>>  
>> <https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date>
>> 
>> Consider:
>> 
>>      croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)
>> 
>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>> evaluates to:
>> 
>>      2022-05-05T12:25:00
>> 
>> That is, it's nicely aligned as you would expect. I would assume from
>> reading the code that this carries over to `CronDataIntervalTimetable`
>> since it uses croniter in exactly this way.
>> 
>> Must we require a "start_date" – ?
> 
> -- 
> 
> Collin McNulty
> Lead Airflow Engineer
> 
> Email: col...@astronomer.io <mailto:john....@astronomer.io>
> Time zone: US Central (CST UTC-6 / CDT UTC-5)
> 
> 
>  <https://www.astronomer.io/>

Reply via email to