Re: [DISCUSS][AIP-39] Richer (and pluggable) schedule_interval on DAGs

Ash Berlin-Taylor Sat, 06 Mar 2021 03:05:02 -0800

I think, yes, AIP-35 or something like it would happily co-exist withthis proposal.

@Daniel <mailto:dpstand...@gmail.com> and I have been discussing this abit on Slack, and one of the questions he asked was if the concept ofdata_interval should be moved from DagRun as James and I suggested downon to the individual task:

suppose i have a new dag hitting 5 api endpoints and pulling data tos3. suppose that yesterday 4 of them succeeded but one failed. today,4 of them should pull from yesterday. but the one that failed shouldpull from 2 days back. so even though these normally have the sameinterval, today they should not.

My view on this is two fold: one, this should primarily be handled byretries on the task, and secondly, having different TaskIstances in thesame DagRun have different data intervals would be much harder toreason about/design the UI around, so for those reasons I still thinkinterval should be a DagRun-level concept.

(He has a stalled AIP-30 where he proposed something to address thiskind of "watermark" case, which we might pick up next after this AIP iscomplete)

One thing we might want to do is extend the interface ofAbstractTimetable to be able to add/update parameters in the contextdict, so the interface could become this:


class AbstractTimetable(ABC):
   @abstractmethod
   def next_dagrun_info(
       date_last_automated_dagrun: Optional[pendulum.DateTime],

       session: Session,
   )-> Optional[DagRunInfo]:
       """

Get information about the next DagRun of this dag after``date_last_automated_dagrun`` -- the

       execution date, and the earliest it could be scheduled

:param date_last_automated_dagrun: The max(execution_date) ofexisting"automated" DagRuns for this dag (scheduled or backfill,but not

           manual)
       """

   @abstractmethod

def set_context_variables(self, dagrun: DagRun,context:Dict[str,Any])->None:

"""

Update or set new context variables to become available in tasktemplates and operators.

"""

What do people think of this? Worth it? Too complex to reason aboutwhat context variables might exist as a result?


*Outstanding question*:

Should "interval-less DAGs" (ones using "CronTimetable" in my proposalvs "DataTimetable") have data_interval_start and end available in thecontext?Does anyone have any better names than TimeDeltaTimetable,DataTimetable, and CronTimetable? (We can probably change these namesright up until release, so not important to get this correct/now/.)Should I try to roll AIP-30<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>in to this, or should we make that a future addition? (My vote is forfuture addition)

I'd like to start voting on this AIP next week (probably on Tuesday) asI think this will be a powerful feature that eases confusing to newusers.


-Ash


On Tue, 2 Mar, 2021 at 23:05, Alex Inhert <alexinh...@yandex.com> wrote:

Is this AIP going to co-exist with AIP-35 "Add Signal BasedScheduling To Airflow"?I think streaming was also discussed there (though it wasn't reallythe use case).
02.03.2021, 22:10, "Ash Berlin-Taylor" <a...@apache.org>:
Hi Kevin,
Interesting idea. My original idea was actually for "interval-lessDAGs" (i.e. ones where it's just "run at this time") would not havedata_interval_start or end, but (while drafting the AIP) we decidedthat it was probably "easier" if those values were always datetimes.
That said, I think having the DB model have those values be nullablewould future proof it without needing another migration to changeit. Do you think this is worth doing now?
I haven't (yet! It's on my list) spent any significant time thinkingabout how to make Airflow play nicely with streaming jobs. If anyoneelse has ideas here please share them
-ash
On Sat, 27 Feb, 2021 at 16:09, Kevin Yang <yrql...@gmail.com<mailto:yrql...@gmail.com>> wrote:
Hi Ash and James,
This is an exciting move. What do you think about using thisopportunity to extend Airflow's support to streaming like usecases? I.e DAGs/tasks that want to run forever like a service. Forsuch use cases, schedule interval might not be meaningful, then dowe want to make the date interval param optional to DagRun and taskinstances? That sounds like a pretty major change to the underlyingmodel of Airflow, but this AIP is so far the best opportunity I sawthat can level up Airflow's support for streaming/service use cases.
Cheers,
Kevin Y
On Fri, Feb 26, 2021 at 8:56 AM Daniel Standish<dpstand...@gmail.com <mailto:dpstand...@gmail.com>> wrote:
Very excited to see this proposal come through and love thedirection this has gone.
Couple comments...

*Tree view / Data completeness view*
When you design your tasks with the canonical idempotence pattern,the tree view shows you both data completeness and task executionhistory (success / failure etc).
When you don't use that pattern (which is my general preference),tree view is only task execution history.
This change has the potential to unlock a data completeness viewfor canonical tasks. It's possible that the "data completenessview" can simply be the tree view. I.e. somehow it can use thesenew classes to know what data was successfully filled and whatdata wasn't.
To the extent we like the idea of either extending / plugging /modifying tree view, or adding a distinct data completeness view,we might want to anticipate the needs of that in this change. Andmaybe no alteration to the proposal would be needed but just wantto throw the idea out there.
*Watermark workflow / incremental processing*
A common pattern in data warehousing is pulling data incrementallyfrom a source.
A standard way to achieve this is at the start of the task, selectmax `updated_at` in source table and hold on to that value for aminute. This is your tentative new high watermark.Now it's time to pull your data. If your task ran before, grablast high watermark. If not, use initial load value.
If successful, update high watermark.
On my team we implemented this with a stateful tasks / statefulprocesses concept (there's a dormant draft AIP here<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>)and a WatermarkOperator that handled the boilerplate*.
Again here, I don't have a specific suggestion at this moment.But I wanted to articulate this workflow because it is common andit wasn't immediately obvious to me in reading AIP-39 how I woulduse it to implement it.
AIP-39 makes airflow more data-aware. So if it can support thiskind of workflow that's great. @Ash Berlin-Taylor<mailto:a...@astronomer.io> do you have thoughts on how it might becompatible with this kind of thing as is?
---
* The base operator is designed so that Subclasses only need toimplement two methods:- `get_high_watermark`: produce the tentative new highwatermark' `watermark_execute`: analogous to implementing poke in asensor, this is where your work is done. `execute` is left to thebase class, and it orchestrates (1) getting last high watermark orinital load value and (2) updating new high watermark if jobsuccessful.

Re: [DISCUSS][AIP-39] Richer (and pluggable) schedule_interval on DAGs

Reply via email to