Re: [DISCUSS][AIP-39] Richer (and pluggable) schedule_interval on DAGs

Ash Berlin-Taylor Thu, 11 Mar 2021 02:13:40 -0800

Summary of changes so far on the AIP:

My proposed rename of DagRun.execution_date is now DagRun.schedule_date(previously I had proposed run_date. Thanks dstandish!)

Timetable classes are renamed to Schedule classes (CronSchedule etc),similarly the DAG argument is now schedule (reminder: schedule_intervalwill not be removed or deprecated, and will still be the way to use"simple" expressions)


-ash

On Wed, 10 Mar, 2021 at 14:15, Ash Berlin-Taylor <a...@apache.org> wrote:

Could change Timetable To Schedule -- that would mean the DAG argbecomes `schedule=CronSchedule(...)` -- a bit close to the current`schedule_interval` but I think clear enough difference.
I do like the name but my one worry with "schedule" is that Schedulerand Schedule are very similar, and might be be confused with eachother for non-native English speakers? (I defer to others' judgmenthere, as this is not something I can experience myself.)
@Kevin Yang <mailto:yrql...@gmail.com> @Daniel Standish<mailto:dpstand...@gmail.com> any final input on this AIP?
On Tue, 9 Mar, 2021 at 16:59, Kaxil Naik <kaxiln...@gmail.com> wrote:
Hi Ash and all,
What do people think of this? Worth it? Too complex to reason aboutwhat context variables might exist as a result?
I think I wouldn't worry about it right now or maybe not as part ofthis AIP. Currently, in one of the Github Issue, a user mentionedthat it is not straightforward to know what is inside the contextdictionary- <https://github.com/apache/airflow/issues/14396>. Somaybe we can tackle this issue separately once the AbstractTimetableis built.
Should "interval-less DAGs" (ones using "CronTimetable" in myproposal vs "DataTimetable") have data_interval_start and endavailable in the context?
hmm.. I would say No but then it contradicts my suggestion to removecontext dict from this AIP. If we are going to use it in schedulerthen yes, where data_interval_start = data_interval_end fromCronTimetable.
Does anyone have any better names than TimeDeltaTimetable,DataTimetable, and CronTimetable? (We can probably change thesenames right up until release, so not important to get this correct/now/.)
No strong opinion here. Just an alternate suggestion can beTimeDeltaSchedule, DataSchedule and CronSchedule
Should I try to roll AIP-30<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>in to this, or should we make that a future addition? (My vote isfor future addition)
I would vote for Future addition too.

Regards,
Kaxil
On Sat, Mar 6, 2021 at 11:05 AM Ash Berlin-Taylor <a...@apache.org<mailto:a...@apache.org>> wrote:
I think, yes, AIP-35 or something like it would happily co-existwith this proposal.
@Daniel <mailto:dpstand...@gmail.com> and I have been discussingthis a bit on Slack, and one of the questions he asked was if theconcept of data_interval should be moved from DagRun as James and Isuggested down on to the individual task:
suppose i have a new dag hitting 5 api endpoints and pulling datato s3. suppose that yesterday 4 of them succeeded but one failed.today, 4 of them should pull from yesterday. but the one thatfailed should pull from 2 days back. so even though these normallyhave the same interval, today they should not.
My view on this is two fold: one, this should primarily be handledby retries on the task, and secondly, having different TaskIstancesin the same DagRun have different data intervals would be muchharder to reason about/design the UI around, so for those reasons Istill think interval should be a DagRun-level concept.
(He has a stalled AIP-30 where he proposed something to addressthis kind of "watermark" case, which we might pick up next afterthis AIP is complete)
One thing we might want to do is extend the interface ofAbstractTimetable to be able to add/update parameters in thecontext dict, so the interface could become this:
class AbstractTimetable(ABC):
    @abstractmethod
    def next_dagrun_info(
        date_last_automated_dagrun: Optional[pendulum.DateTime],

        session: Session,
    )-> Optional[DagRunInfo]:
        """
Get information about the next DagRun of this dag after``date_last_automated_dagrun`` -- the
        execution date, and the earliest it could be scheduled
:param date_last_automated_dagrun: The max(execution_date)of existing"automated" DagRuns for this dag (scheduled orbackfill, but not
            manual)
        """

    @abstractmethod
def set_context_variables(self, dagrun: DagRun,context:Dict[str,Any])->None:
        """
Update or set new context variables to become available intask templates and operators.
        """
What do people think of this? Worth it? Too complex to reason aboutwhat context variables might exist as a result?
*Outstanding question*:
Should "interval-less DAGs" (ones using "CronTimetable" in myproposal vs "DataTimetable") have data_interval_start and endavailable in the context?Does anyone have any better names thanTimeDeltaTimetable, DataTimetable, and CronTimetable? (We canprobably change these names right up until release, so notimportant to get this correct /now/.)Should I try to roll AIP-30<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>in to this, or should we make that a future addition? (My vote isfor future addition)
I'd like to start voting on this AIP next week (probably onTuesday) as I think this will be a powerful feature that easesconfusing to new users.
-Ash
On Tue, 2 Mar, 2021 at 23:05, Alex Inhert <alexinh...@yandex.com<mailto:alexinh...@yandex.com>> wrote:
Is this AIP going to co-exist with AIP-35 "Add Signal BasedScheduling To Airflow"?I think streaming was also discussed there (though it wasn'treally the use case).
02.03.2021, 22:10, "Ash Berlin-Taylor" <a...@apache.org<mailto:a...@apache.org>>:
Hi Kevin,
Interesting idea. My original idea was actually for"interval-less DAGs" (i.e. ones where it's just "run at thistime") would not have data_interval_start or end, but (whiledrafting the AIP) we decided that it was probably "easier" ifthose values were always datetimes.
That said, I think having the DB model have those values benullable would future proof it without needing another migrationto change it. Do you think this is worth doing now?
I haven't (yet! It's on my list) spent any significant timethinking about how to make Airflow play nicely with streamingjobs. If anyone else has ideas here please share them
-ash
On Sat, 27 Feb, 2021 at 16:09, Kevin Yang <yrql...@gmail.com<mailto:yrql...@gmail.com>> wrote:
Hi Ash and James,
This is an exciting move. What do you think about using thisopportunity to extend Airflow's support to streaming like usecases? I.e DAGs/tasks that want to run forever like a service.For such use cases, schedule interval might not be meaningful,then do we want to make the date interval param optional toDagRun and task instances? That sounds like a pretty majorchange to the underlying model of Airflow, but this AIP is sofar the best opportunity I saw that can level up Airflow'ssupport for streaming/service use cases.
Cheers,
Kevin Y
On Fri, Feb 26, 2021 at 8:56 AM Daniel Standish<dpstand...@gmail.com <mailto:dpstand...@gmail.com>> wrote:
Very excited to see this proposal come through and love thedirection this has gone.
Couple comments...

*Tree view / Data completeness view*
When you design your tasks with the canonical idempotencepattern, the tree view shows you both data completeness andtask execution history (success / failure etc).
When you don't use that pattern (which is my generalpreference), tree view is only task execution history.
This change has the potential to unlock a data completenessview for canonical tasks. It's possible that the "datacompleteness view" can simply be the tree view. I.e. somehowit can use these new classes to know what data was successfullyfilled and what data wasn't.
To the extent we like the idea of either extending / plugging /modifying tree view, or adding a distinct data completenessview, we might want to anticipate the needs of that in thischange. And maybe no alteration to the proposal would beneeded but just want to throw the idea out there.
*Watermark workflow / incremental processing*
A common pattern in data warehousing is pulling dataincrementally from a source.
A standard way to achieve this is at the start of the task,select max `updated_at` in source table and hold on to thatvalue for a minute. This is your tentative new high watermark.Now it's time to pull your data. If your task ran before, grablast high watermark. If not, use initial load value.
If successful, update high watermark.
On my team we implemented this with a stateful tasks / statefulprocesses concept (there's a dormant draft AIP here<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>)and a WatermarkOperator that handled the boilerplate*.
Again here, I don't have a specific suggestion at this moment.But I wanted to articulate this workflow because it is commonand it wasn't immediately obvious to me in reading AIP-39 how Iwould use it to implement it.
AIP-39 makes airflow more data-aware. So if it can supportthis kind of workflow that's great. @Ash Berlin-Taylor<mailto:a...@astronomer.io> do you have thoughts on how it mightbe compatible with this kind of thing as is?
---
* The base operator is designed so that Subclasses only need toimplement two methods:- `get_high_watermark`: produce the tentative new highwatermark' `watermark_execute`: analogous to implementing poke in asensor, this is where your work is done. `execute` is left tothe base class, and it orchestrates (1) getting last highwatermark or inital load value and (2) updating new highwatermark if job successful.

Re: [DISCUSS][AIP-39] Richer (and pluggable) schedule_interval on DAGs

Reply via email to