More technical details: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#how-to-communicate
On Tue, Mar 22, 2022 at 5:03 PM Jarek Potiuk <ja...@potiuk.com> wrote: > When there are differing opinions but seems that there is a favourable > option someone (actually anyone) might call for a vote > https://www.apache.org/foundation/voting.html#votes-on-code-modification > > For such votes, committers have binding votes. -1 is a veto (usually needs > to be justified) and kills the proposal unless the person who vetoed will > change their mind. > > J. > > On Tue, Mar 22, 2022 at 4:53 PM Philippe Lanoe <pla...@cloudera.com.invalid> > wrote: > >> I agree with Jarek in the sense that the DAG developer **should** know >> when the DAG should start, however in practice for time-based scheduling it >> can be cumbersome to maintain especially: >> - everytime the jobs evolves and gets updated / new version >> - when developers have to maintain hundreds/thousands of independent >> jobs, keeping track of start_date for each of them can be difficult >> >> Not to mention that many companies do not have state-of-the-art CI/CD >> processes which could allow them to dynamically change the start date. In >> many cases when a change is made to a job, the developers simply want to >> update the job and the next run will take it into account. >> >> I also agree with Collin and Constance that the "run last interval" is a >> valid use case and therefore this parameter could accept three values to >> handle all of these cases. >> However I would suggest: >> >> Catchup=True : run all intervals >> Catchup=False: do not run any past interval >> Catchup="Last Interval" (or any better name :)) >> >> I know that the DAG authors who relied on Catchup=False to run the last >> interval will need to adjust their DAG but if added a third option not to >> trigger any run then the DAG authors who relied on catchup=False + set >> start date will also need to update their DAG to have the proper value. And >> in my opinion when I read Catchup=False the natural way of reading it is >> "no catchup", therefore it would be better to fix it in the right direction. >> >> What is the next step here? Who can decide / approve such a new feature >> request? >> >> Thanks, >> Philippe >> >> On Mon, Mar 21, 2022 at 4:05 PM Constance Martineau >> <consta...@astronomer.io.invalid> wrote: >> >>> I've had a variation of this debate a few times, and the behaviour you >>> find intuitive in my opinion comes down to your background (software >>> engineer vs data engineer vs BI developer vs data analyst), industry >>> standards, and the scope of responsibility DAG authors have at your >>> organization. My vote is to extend the catchup setting to either run all >>> intervals (catchup=True today), run the most recent interval (catchup=False >>> today) or schedule the next interval. I have seen organizations where both >>> would be beneficial depending on the data pipeline in question. >>> >>> All of Alex's points are why I think we at least need the option. >>> >>> I came from an institutional investor, and we had plenty of DAGs that >>> ran daily, weekly, monthly, quarterly and yearly. >>> >>> Many financial analysts - who were not DAG authors themselves - would >>> have access to the Airflow Webserver in order to rerun tasks. They do not >>> have the ability to adjust the start_date. During Audit season, it was >>> common to see yearly dags being run for earlier years. To support this, >>> means we needed to implement a start date for an earlier year. Saw DAG >>> authors deal with this in two ways: Set the start_date to first day of >>> prior year to get the DAG out and let it run, then modify the start_date to >>> something earlier or set the start_date to something earlier, watch the DAG >>> and quickly update the state of the dag to success (or fail). One is better >>> than the other (no fun explaining to an executive why reports were >>> accidentally sent externally), but neither are great. Option 3 - setting >>> the start_date between the data interval period and leaving it - always >>> caused confusion with other stakeholders. >>> >>> A global default, and DAG-level option would have been amazing. >>> >>> >>> >>> On Sun, Mar 20, 2022 at 5:15 PM Collin McNulty >>> <col...@astronomer.io.invalid> wrote: >>> >>>> While that’s true, I think there are often stakeholders that expect a >>>> DAG to run only on the day for which it is scheduled. It’s pretty >>>> straightforward for me to explain to non-technical stakeholders that “aw >>>> shucks we deployed just a little too late for this week’s run, we’ll run it >>>> manually to fix it”. On the contrary, explaining why a DAG that I said >>>> would run on Tuesdays sent out an alert on a Friday to a VP of Finance is … >>>> rough. I understand that Airflow does not make guarantees about when tasks >>>> will execute, but I try to scale such that when a task can start and when >>>> it does start are close enough to not have to explain the difference to >>>> other stakeholders. >>>> >>>> Editing start_date can also be tough in some conditions. If I’m baking >>>> a DAG into an image, using build-once-deploy-to-many CI/CD, and testing in >>>> a lower environment for longer than the interval between runs, I’m toast on >>>> setting the start_date to avoid what I consider a spurious run. That’s a >>>> lot of “ands” but I think it’s a fairly common set of circumstances we >>>> should support. >>>> >>>> Collin McNulty >>>> >>>> >>>> >>>> On Sun, Mar 20, 2022 at 3:12 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>>> >>>>> Good. Love some mental stretching :). >>>>> >>>>> I believe you should **not** base the time of your run on the time it >>>>> is released. Should not the DAG author know when there is a "start date" >>>>> planned for the DAG? Should the decision on when the DAG interval start be >>>>> made on combination of both start date in the dag **and** the time of not >>>>> only when it's merged, but actually when airflow first **parses** the DAG. >>>>> Not even mentioning the time zone issues. >>>>> >>>>> Imagine you case when DAG is merged 5 minutes between the midnight >>>>> Mon/Tue and you have many DAGs. So many that parsing all the DAGs can take >>>>> 20 minutes. Then the fact if your DAG runs this interval or that depends >>>>> not even on the decision of when it is merged but also how long it takes >>>>> Airflow to get to parse your DAG for the first time. >>>>> >>>>> Sounds pretty crazy :). >>>>> >>>>> J. >>>>> >>>>> >>>>> On Sun, Mar 20, 2022 at 9:02 PM Collin McNulty >>>>> <col...@astronomer.io.invalid> wrote: >>>>> >>>>>> Jarek, >>>>>> >>>>>> I tend to agree with you on this, but let me play devil’s advocate. >>>>>> If I have a DAG that runs a report every Tuesday, I might want it to run >>>>>> every Tuesday starting whenever I am able to release the DAG. But if I >>>>>> release on a Friday, I don’t want it to try to run “for” last Tuesday. In >>>>>> this case, the correct start_date for the dag is the day I release the >>>>>> DAG, >>>>>> but I don’t know this date ahead of time and it differs per environment. >>>>>> Doing this properly seems doable with a CD process that edits the DAG to >>>>>> insert the start_date, but that’s fairly sophisticated tooling for a >>>>>> scenario that I imagine is quite common. >>>>>> >>>>>> Collin McNulty >>>>>> >>>>>> On Sun, Mar 20, 2022 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> >>>>>> wrote: >>>>>> >>>>>>> Once again - why is it bad to set a start_date in the future, when - >>>>>>> well - you **actually** want to run the first interval in the future >>>>>>> ? >>>>>>> What prevents you from setting the start-date to be a fixed time in >>>>>>> the future, where the start date is within the interval you want to >>>>>>> start first? Is it just "I do not want to specify conveniently >>>>>>> whatever past date will be easy to type?" >>>>>>> If this is the only reason, then it has a big drawback - because >>>>>>> "start_date" is **actually** supposed to be the piece of metadata for >>>>>>> the DAG that will tell you what was the intention of the DAG writer >>>>>>> on >>>>>>> when to start it. And precisely one that allows you to start things >>>>>>> in >>>>>>> the future. >>>>>>> >>>>>>> Am I missing something? >>>>>>> >>>>>>> On Sun, Mar 20, 2022 at 7:42 PM Larry Komenda >>>>>>> <avoicelikerunningwa...@gmail.com> wrote: >>>>>>> > >>>>>>> > Alex, that's a good point regarding the need to run a DAG for the >>>>>>> most recent schedule interval right away. I hadn't thought of that >>>>>>> scenario >>>>>>> as I haven't needed to build a DAG with that large of a scheduling gap. >>>>>>> In >>>>>>> that case I agree with you - it seems like it would make more sense to >>>>>>> make >>>>>>> this configurable. >>>>>>> > >>>>>>> > Perhaps there could be an additional DAG-level parameter that >>>>>>> could be set alongside "catchup" to control this behavior. Or there >>>>>>> could >>>>>>> be a new parameter that could eventually replace "catchup" that >>>>>>> supported 3 >>>>>>> options - "catchup", "run most recent interval only", and "run next >>>>>>> interval only". >>>>>>> > >>>>>>> > On Sat, Mar 19, 2022 at 1:02 PM Alex Begg <alex.b...@gmail.com> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> I would not consider it a bug to have the latest data interval >>>>>>> run when you enable a DAG that is set to catchup=False. >>>>>>> >> >>>>>>> >> I have legitimate use for that feature by having my production >>>>>>> environment have catchup_by_default=True but my lower environments are >>>>>>> using catchup_by_default=False, meaning if I want to test the DAG >>>>>>> behavior >>>>>>> as scheduled in a lower environment I can just enable the DAG. >>>>>>> >> >>>>>>> >> For example, in a staging environment if I need to test out the >>>>>>> functionality of a DAG that was scheduled for @monthly and there was no >>>>>>> way >>>>>>> to test the most recent data interval, than to test a true data >>>>>>> interval of >>>>>>> the DAG it could be many days, even weeks until they will occur. >>>>>>> >> >>>>>>> >> Triggering a DAG won’t run the latest data interval, it will use >>>>>>> the current time as the logical_date, right? So that will won’t let me >>>>>>> test >>>>>>> a single as scheduled data interval. So in that @monthly senecio it >>>>>>> will be >>>>>>> impossible for me to test the functionality of a single data interval >>>>>>> unless I wait multiple weeks. >>>>>>> >> >>>>>>> >> I see there could be a desire to not run the latest data interval >>>>>>> and just start with whatever full interval follows the DAG being turned >>>>>>> on. >>>>>>> However I think that should be configurable, not fixed permanently. >>>>>>> >> >>>>>>> >> Alternatively it could be ideal to have a way to trigger a >>>>>>> specific run for a catchup=False DAG that just got enabled by adding a >>>>>>> 3d >>>>>>> option to the trigger button drop down to trigger a past scheduled run. >>>>>>> Then in that dialog the form can default to the most recent full data >>>>>>> interval but then let you also specify a specific past interval based on >>>>>>> the DAG's schedule. I often had to debug a DAG in production and I >>>>>>> wanted >>>>>>> to trigger a specific past data interval, not just the most recent. >>>>>>> >> >>>>>>> >> Alex Begg >>>>>>> >> >>>>>>> >> On Thu, Mar 17, 2022 at 4:58 PM Larry Komenda < >>>>>>> avoicelikerunningwa...@gmail.com> wrote: >>>>>>> >>> >>>>>>> >>> I agree with this. I'd much rather have to trigger a single >>>>>>> manual run the first time I enable a DAG than to either wait to enable >>>>>>> until after I want it to run or by editing the start_date of the DAG >>>>>>> itself. >>>>>>> >>> >>>>>>> >>> I'd be in favor of adjusting this behavior either permanently or >>>>>>> by a configuration. >>>>>>> >>> >>>>>>> >>> On Fri, Mar 4, 2022 at 3:00 PM Philippe Lanoe >>>>>>> <pla...@cloudera.com.invalid> wrote: >>>>>>> >>>> >>>>>>> >>>> Hello Daniel, >>>>>>> >>>> >>>>>>> >>>> Thank you for your answer. In your example, as I experienced, >>>>>>> the first run would not be 2010-01-01 but 2022-03-03, 00:00:00 (it is >>>>>>> currently March 4 - 21:00 here), which is the execution date >>>>>>> corresponding >>>>>>> to the start of the previous data interval, but the result is the same: >>>>>>> an >>>>>>> undesired dag run. (For instance, in case of cron schedule '00 22 * * >>>>>>> *', >>>>>>> one dagrun would be started immediately with execution date of >>>>>>> 2022-03-02, >>>>>>> 22:00:00) >>>>>>> >>>> >>>>>>> >>>> I also agree with you that it could be categorized as a bug and >>>>>>> I would also vote for a fix. >>>>>>> >>>> >>>>>>> >>>> Would be great to have the feedback of others on this. >>>>>>> >>>> >>>>>>> >>>> On Fri, Mar 4, 2022 at 6:17 PM Daniel Standish >>>>>>> <daniel.stand...@astronomer.io.invalid> wrote: >>>>>>> >>>>> >>>>>>> >>>>> You are saying, when you turn on for the first time a dag with >>>>>>> e.g. @daily schedule, and catchup = False, if start date is 2010-01-01, >>>>>>> then it would run first the 2010-01-01 run, then the current run >>>>>>> (whatever >>>>>>> yesterday is)? That sounds familiar. >>>>>>> >>>>> >>>>>>> >>>>> Yeah I don't like that behavior. I agree that, as you say, >>>>>>> it's not the intuitive behavior. Seems it could reasonably be >>>>>>> categorized >>>>>>> as a bug. I'd prefer we just "fix" it rather than making it >>>>>>> configurable. >>>>>>> But some might have concerns re backcompat. >>>>>>> >>>>> >>>>>>> >>>>> What do others think? >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>>> -- >>>>>> >>>>>> Collin McNulty >>>>>> Lead Airflow Engineer >>>>>> >>>>>> Email: col...@astronomer.io <john....@astronomer.io> >>>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>>>>> >>>>>> >>>>>> <https://www.astronomer.io/> >>>>>> >>>>> -- >>>> >>>> Collin McNulty >>>> Lead Airflow Engineer >>>> >>>> Email: col...@astronomer.io <john....@astronomer.io> >>>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>>> >>>> >>>> <https://www.astronomer.io/> >>>> >>> >>> >>> -- >>> >>> Constance Martineau >>> Product Manager >>> >>> Email: consta...@astronomer.io >>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4) >>> >>> >>> <https://www.astronomer.io/> >>> >>>