Added my comments.

In short. I think there are some cool ideas on how to implement watermarks
in the docs. The recent proposal in the doc by Hussein looks good and while
not yet complete solution it shows how we could map a "generic state"
approach into watermarks, and I have a feeling that we are at the point
that we need state for different things so we could design a generic state
API and simply use it in "Watermark specific ways" - making watermarks the
first user of sich generic State storage.

Why? We will likely need more state sharing - not only for Triggers, but
for other components of ours - it does not seem like such an API would be
complex to implement or design. Some common properties of such api I
imagine:

* type of state stored + unique id
* CRUD operations (state creation/retrieval/update/deletion)
* extensibility in the future (ability to use external storage like in Xcom)
* async awareness
* ability to be accessed via Task SDK
* ownership (team_id) in the future
* some retention mechanisms possibly built in database cleanup if we use db
(so need to keep some timestamps)

Once we have such a mechanisms, we could **easily** map it into Watermarks
and make Watermarks the first user of such an API

J.


On Thu, Jul 31, 2025 at 9:46 PM Vincent Beck <[email protected]> wrote:

> I commented on the AIP. Overall, I agree on the issue and that this is
> something we should solve. This is something I brought up in AIP-82 when I
> wrote it as "infinite scheduling issue" (without offering any solution :)).
> I also think this is a blocker to implement many different triggers based
> on non queue based assets.
>
> If some are interested in that topic, it would be great to have more
> feedbacks :)
>
> On 2025/07/28 14:16:10 Jake Roach wrote:
> > Guangyang and I have created a draft AIP, which you can find here:
> > https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-93.
> > Curious to get folks' thoughts and opinions!
> >
> >
> > On Fri, Jul 25, 2025 at 9:12 PM Guangyang Li <[email protected]> wrote:
> >
> > > Jake and I have put down the details in this AIP draft doc
> > > <
> > >
> https://docs.google.com/document/d/1gnGpTDhTpxpC48-kvr3jxL4GWyagMKzItWV-30GZu2U/edit?tab=t.0
> > > >
> > > .
> > >
> > > Instead of calling this feature state persisting, we decided to call it
> > > asset watermark,
> > > as it's mainly an enhancement of asset-oriented processing. You can
> find
> > > the details
> > > and examples in the doc. Please comment. We plan to convert it to an
> > > official AIP
> > > doc later once it's stable.
> > >
> > >
> > > Guangyang
> > >
> > > On Thu, Jun 12, 2025 at 4:48 PM Karen Braganza <
> [email protected]>
> > > wrote:
> > >
> > > > I think the process_state model would be useful in the
> HttpEventTrigger
> > > > that I am working on. The HttpEventTrigger sends requests to an API
> and
> > > > triggers an event based on a user-defined response_check function.
> If the
> > > > response_check function needs to evaluate multiple API responses
> > > > cumulatively, it would be useful to store and retrieve past API
> > > responses.
> > > > It would also be useful for task instances to retrieve the
> process_state
> > > > data. For the HttpEventTrigger, this would mean enabling task
> instances
> > > to
> > > > retrieve and act on API response data received within the trigger.
> > > >
> > > > I'm sure there would be similar use cases in other EventTriggers as
> well.
> > > >
> > > > On Thu, Jun 12, 2025 at 1:26 PM Daniel Standish
> > > > <[email protected]> wrote:
> > > >
> > > > > Alright since I was summoned...
> > > > >
> > > > > When I was an airflow user, I did a lot of incremental processes.
> > > Pretty
> > > > > much everything was incremental.  Data warehousing / analytics
> shop /
> > > > > e-commerce reporting / integrations this kind of thing.
> > > > >
> > > > > One common use case is implementing something like a fivetran,
> which I
> > > > did
> > > > > a few times.
> > > > >
> > > > > For me, execution date was almost entirely useless.  Execution
> date is
> > > > > there for partition-driven workloads.
> > > > >
> > > > > For incremental, you need to track your state somehow.
> > > > >
> > > > > That's why I experimented with various state storage interfaces,
> and
> > > > > developed a watermark operator, which we used a lot.  And I demoed
> a
> > > > > version of them here <https://github.com/apache/airflow/pull/19051
> >,
> > > and
> > > > > authored AIP-30
> > > > > <
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence
> > > > > >
> > > > > .
> > > > >
> > > > > I wrote AIP-30 when I was still contributing to Airflow for
> funsies,
> > > and
> > > > > didn't get a ton of engagement on it so it sort of languished, then
> > > when
> > > > I
> > > > > became full time airflow dev, there were other priorities.
> > > > >
> > > > > But to me the use case is still pretty obvious.  Nothing we have
> added
> > > > > since then really explicitly supports incremental workflows.
> > > > >
> > > > > To me the question is (as it was then, and I think I mentioned
> this in
> > > > the
> > > > > AIP), do you provide a generic interface where user controls
> namespace
> > > > and
> > > > > name of the state you are trying to persist?  Or instead do you
> provide
> > > > > mechanisms to store state on existing objects.  So e.g. on
> trigger, on
> > > > > task, on whatever, you can do `self.save_state(key...)` etc.  In my
> > > > > proposal I think I leaned towards generic, and it seems Jake leans
> the
> > > > same
> > > > > way.  There are pros and cons.
> > > > >
> > > > > In terms of the underlying storage mechanism, it seems pretty
> > > reasonable
> > > > to
> > > > > allow this to be pluggable like everything else.  I used different
> > > > > "backends" at different times -- s3, or database.  Typically you
> don't
> > > > need
> > > > > mega low latency with the type of tasks Airflow is used for.
> > > > >
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to