Re: [DISCUSS][AIP-39] Richer (and pluggable) schedule_interval on DAGs

Ash Berlin-Taylor Tue, 02 Mar 2021 12:10:02 -0800

Hi Kevin,

Interesting idea. My original idea was actually for "interval-lessDAGs" (i.e. ones where it's just "run at this time") would not havedata_interval_start or end, but (while drafting the AIP) we decidedthat it was probably "easier" if those values were always datetimes.

That said, I think having the DB model have those values be nullablewould future proof it without needing another migration to change it.Do you think this is worth doing now?

I haven't (yet! It's on my list) spent any significant time thinkingabout how to make Airflow play nicely with streaming jobs. If anyoneelse has ideas here please share them.


-ash

On Sat, 27 Feb, 2021 at 16:09, Kevin Yang <yrql...@gmail.com> wrote:

Hi Ash and James,
This is an exciting move. What do you think about using thisopportunity to extend Airflow's support to streaming like use cases?I.e. DAGs/tasks that want to run forever like a service. For such usecases, schedule interval might not be meaningful, then do we want tomake the date interval param optional to DagRun and task instances?That sounds like a pretty major change to the underlying model ofAirflow, but this AIP is so far the best opportunity I saw that canlevel up Airflow's support for streaming/service use cases.
Cheers,
Kevin Y
On Fri, Feb 26, 2021 at 8:56 AM Daniel Standish <dpstand...@gmail.com<mailto:dpstand...@gmail.com>> wrote:
Very excited to see this proposal come through and love thedirection this has gone.
Couple comments...

*Tree view / Data completeness view*
When you design your tasks with the canonical idempotence pattern,the tree view shows you both data completeness and task executionhistory (success / failure etc).
When you don't use that pattern (which is my general preference),tree view is only task execution history.
This change has the potential to unlock a data completeness view forcanonical tasks. It's possible that the "data completeness view"can simply be the tree view. I.e. somehow it can use these newclasses to know what data was successfully filled and what datawasn't.
To the extent we like the idea of either extending / plugging /modifying tree view, or adding a distinct data completeness view, wemight want to anticipate the needs of that in this change. Andmaybe no alteration to the proposal would be needed but just want tothrow the idea out there.
*Watermark workflow / incremental processing*
A common pattern in data warehousing is pulling data incrementallyfrom a source.
A standard way to achieve this is at the start of the task, selectmax `updated_at` in source table and hold on to that value for aminute. This is your tentative new high watermark.Now it's time to pull your data. If your task ran before, grab lasthigh watermark. If not, use initial load value.
If successful, update high watermark.
On my team we implemented this with a stateful tasks / statefulprocesses concept (there's a dormant draft AIP here<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>)and a WatermarkOperator that handled the boilerplate*.
Again here, I don't have a specific suggestion at this moment. ButI wanted to articulate this workflow because it is common and itwasn't immediately obvious to me in reading AIP-39 how I would useit to implement it.
AIP-39 makes airflow more data-aware. So if it can support thiskind of workflow that's great. @Ash Berlin-Taylor<mailto:a...@astronomer.io> do you have thoughts on how it might becompatible with this kind of thing as is?
---
* The base operator is designed so that Subclasses only need toimplement two methods:
    - `get_high_watermark`: produce the tentative new high watermark
' `watermark_execute`: analogous to implementing poke in asensor, this is where your work is done. `execute` is left to thebase class, and it orchestrates (1) getting last high watermark orinital load value and (2) updating new high watermark if jobsuccessful.

Re: [DISCUSS][AIP-39] Richer (and pluggable) schedule_interval on DAGs

Reply via email to