I would love for VOTE to get started on this one. I think most of the
commenters and those who replied to this email are happy with the proposal
on the poll-based approach.

Regarding the push-based approach, I am not convinced that the proposed
implementation has any gains over what's already available with the Dataset
Event Create API; the one user-to-one function mapping is an odd user
experience. I'm curious to hear what others think.

On Thu, 1 Aug 2024 at 17:39, Kaxil Naik <kaxiln...@gmail.com> wrote:

> I agree with both of you that it is indeed a good idea and that it can be
> added in Future work -- doesn't need to be part of this AIP.
>
> Thanks for the interest. I was not aware of such feature and this looks
>> really cool! I definitely think that can be useful for Airflow, especially
>> for testing when you can easily replay events received in the past.
>> However, I do not think it should be part of the AIP and, as you mentioned,
>> if should be a future work or a follow-up item of the AIP. Please let me
>> know if you (or anyone) disagree with this and we can talk about it.
>> Otherwise I'll update the future work section of the AIP and mention this
>> archive and replay feature.
>
>
> On Thu, 1 Aug 2024 at 16:11, Vincent Beck <vincb...@apache.org> wrote:
>
>> Hey Pavan,
>>
>> Thanks for the interest. I was not aware of such feature and this looks
>> really cool! I definitely think that can be useful for Airflow, especially
>> for testing when you can easily replay events received in the past.
>> However, I do not think it should be part of the AIP and, as you mentioned,
>> if should be a future work or a follow-up item of the AIP. Please let me
>> know if you (or anyone) disagree with this and we can talk about it.
>> Otherwise I'll update the future work section of the AIP and mention this
>> archive and replay feature.
>>
>> On 2024/08/01 01:21:58 Pavankumar Gopidesu wrote:
>> > Thanks Vincent, I took a look , this is really good. Don't have access
>> to
>> > the confluence page to comment :) so adding it here.
>> >
>> > As events arrive-->do somework-->end.
>> >
>> > So I'm uncertain if my comment pertains to the current poll/push model
>> or
>> > if it fits part of future work(seen event batching ).
>> >
>> > Have you given any thought to the event archival mechanism and event
>> > replay? This could significantly aid in testing and recovery of workflow
>> > and testing new functionality with events by just replay the events. The
>> > archival mechanism I am referring to is similar to today in AWS we have
>> > Event Bridge Archive and Replay.
>> >
>> > Regards,
>> > Pavan
>> >
>> > On Thu, Aug 1, 2024 at 1:29 AM Kaxil Naik <kaxiln...@gmail.com> wrote:
>> >
>> > > I actually did manage to take a look, thanks for the work. I am +1 on
>> the
>> > > poll-based approach -- left a comment on the push-based: I am not
>> sure of
>> > > why we need a function since create asset event API endpoint should
>> have
>> > > all info needed for what the Asset was.
>> > >
>> > > On Thu, 1 Aug 2024 at 01:14, Kaxil Naik <kaxiln...@gmail.com> wrote:
>> > >
>> > > > Thanks Vincent, I will take a look again tomorrow.
>> > > >
>> > > > On Tue, 30 Jul 2024 at 18:47, Vincent Beck <vincb...@apache.org>
>> wrote:
>> > > >
>> > > >> Hi everyone,
>> > > >>
>> > > >> I updated the AIP-82 given the different comments and concerns I
>> > > >> received. I also tried to reply to all comments individually. I
>> would
>> > > >> really appreciate if you can do a second pass and let me know what
>> you
>> > > >> think. Overall, this is what I changed in the AIP:
>> > > >>
>> > > >> - Push based event-driven scheduling. I updated this section
>> entirely
>> > > >> because I received many concerns about the previous proposal. The
>> > > overall
>> > > >> idea now is to leverage the create asset event API endpoint to send
>> > > >> notifications from external (e.g. cloud provider) to Airflow
>> > > environment.
>> > > >>
>> > > >> - I updated the poll based event-driven scheduling DAG author
>> experience
>> > > >> to use a message queue scenario. I understood that this is
>> probably the
>> > > >> main use case we are trying to cover with this AIP, thus I used it
>> as
>> > > >> example and mentioned it multiple times across the AIP.
>> > > >>
>> > > >> Thanks again for your time :)
>> > > >>
>> > > >>
>> > > >>
>> > >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow
>> > > >>
>> > > >> Vincent
>> > > >>
>> > > >> On 2024/07/29 15:58:23 Vincent Beck wrote:
>> > > >> > Thanks a lot all for the comments, this is very much
>> appreciated! I
>> > > >> received many comments from this thread and in confluence, thanks
>> again.
>> > > >> I'll try to address them all in the AIP and will send an email in
>> this
>> > > >> thread once done. I will most likely revisit the push-based
>> approach
>> > > given
>> > > >> the number of concerns I received, thanks Jarek for proposing
>> another
>> > > >> solution, I'll probably go down that path.
>> > > >> >
>> > > >> > One follow-up question Vikram.
>> > > >> >
>> > > >> > > The bespoke triggerer approach completely makes sense for the
>> long
>> > > >> tail here, but can we do better for the 20% of scenarios which
>> cover
>> > > well
>> > > >> over 80% of usage here is the question in my mind. Or, are you
>> thinking
>> > > of
>> > > >> those as being covered in the "push" model?
>> > > >> >
>> > > >> > Could you share more details about what is this "20% of scenarios
>> > > which
>> > > >> cover well over 80% of usage" please?
>> > > >> >
>> > > >> > Vincent
>> > > >> >
>> > > >> > On 2024/07/29 15:37:50 Kaxil Naik wrote:
>> > > >> > > Thanks Vincent for driving these, I have added my comments to
>> the
>> > > AIP
>> > > >> too.
>> > > >> > >
>> > > >> > > Regards,
>> > > >> > > Kaxil
>> > > >> > >
>> > > >> > > On Fri, 26 Jul 2024 at 20:16, Scheffler Jens (XC-AS/EAE-ADA-T)
>> > > >> > > <jens.scheff...@de.bosch.com.invalid> wrote:
>> > > >> > >
>> > > >> > > > +1 on the comments of Vikram and Jarek, added main points on
>> > > >> confluence
>> > > >> > > >
>> > > >> > > > Sent from Outlook for iOS<https://aka.ms/o0ukef>
>> > > >> > > > ________________________________
>> > > >> > > > From: Vikram Koka <vik...@astronomer.io.INVALID>
>> > > >> > > > Sent: Friday, July 26, 2024 8:46:55 PM
>> > > >> > > > To: dev@airflow.apache.org <dev@airflow.apache.org>
>> > > >> > > > Subject: Re: [DISCUSS] External event driven scheduling in
>> Airflow
>> > > >> > > >
>> > > >> > > > Vincent,
>> > > >> > > >
>> > > >> > > > Thanks for writing this up. The overview looks really good!
>> > > >> > > >
>> > > >> > > > I will leave my comments in the AIP as well, but at a high
>> level
>> > > >> they are
>> > > >> > > > both relatively focused on the "how", rather than the "what".
>> > > >> > > > With respect to the pull / polling approach, I completely
>> agree
>> > > >> that some
>> > > >> > > > incarnation of this is needed.
>> > > >> > > > I am less certain as to how on this part. The bespoke
>> triggerer
>> > > >> approach
>> > > >> > > > completely makes sense for the long tail here, but can we do
>> > > better
>> > > >> for the
>> > > >> > > > 20% of scenarios which cover well over 80% of usage here is
>> the
>> > > >> question in
>> > > >> > > > my mind. Or, are you thinking of those as being covered in
>> the
>> > > >> "push"
>> > > >> > > > model?
>> > > >> > > >
>> > > >> > > > Which leads to the "push" model approach.
>> > > >> > > > I am struggling with the same question that Jarek raised here
>> > > about
>> > > >> whether
>> > > >> > > > we need a new Airflow entity over and beyond the existing
>> REST API
>> > > >> for the
>> > > >> > > > same.
>> > > >> > > > I am concerned about this becoming a vector of attack on
>> Airflow.
>> > > >> > > > I see that this is a hot topic of discussion in the
>> Confluence doc
>> > > >> as well,
>> > > >> > > > but wanted to summarize here as well, so it didn't get lost
>> in the
>> > > >> threads
>> > > >> > > > of comments.
>> > > >> > > >
>> > > >> > > > Best regards,
>> > > >> > > > Vikram
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Fri, Jul 26, 2024 at 5:16 AM Jarek Potiuk <
>> ja...@potiuk.com>
>> > > >> wrote:
>> > > >> > > >
>> > > >> > > > > Thanks Vincent. I took a look and I have a general
>> comment. I
>> > > >> > > > > strongly think external driven scheduling is really needed
>> -
>> > > >> especially,
>> > > >> > > > it
>> > > >> > > > > should be much easier for a user to "plug-in" such an
>> external
>> > > >> event to
>> > > >> > > > > Airflow. And there are two parts of it - as correctly
>> stated
>> > > >> there - pull
>> > > >> > > > > and push.
>> > > >> > > > >
>> > > >> > > > > For the pull - I think it would be great to have a kind of
>> > > >> specialized
>> > > >> > > > > Triggers that will be started when DAG is parsed - and
>> those
>> > > >> Triggers
>> > > >> > > > could
>> > > >> > > > > generate the events for DAGs. I think basically that's all
>> that
>> > > is
>> > > >> > > > needed,
>> > > >> > > > > for example I imagine a pubsub trigger that will subscribe
>> to
>> > > >> messages
>> > > >> > > > > coming on the pubsub queue and fire "Asset" event when a
>> message
>> > > >> is
>> > > >> > > > > received. Not much controversy there - I am not sure about
>> the
>> > > >> polling
>> > > >> > > > > thing , because I've always believed that when
>> "asyncio-native"
>> > > >> Trigger
>> > > >> > > > is
>> > > >> > > > > run in the asyncio event loop, we do not "poll" every
>> second or
>> > > >> so (but
>> > > >> > > > > maybe this is just coming from some specific triggers  that
>> > > >> actually do
>> > > >> > > > > such regular poll. But yes - there are polls  like running
>> > > select
>> > > >> on the
>> > > >> > > > DB
>> > > >> > > > > that cannot be easily "async-ed" so having a configurable
>> > > polling
>> > > >> time
>> > > >> > > > > would be good there (but I am not sure maybe it's even
>> possible
>> > > >> today). I
>> > > >> > > > > think this would be really great if we have that option,
>> because
>> > > >> it makes
>> > > >> > > > > it much easier to set up the authorization for Airlfow
>> users -
>> > > >> rather
>> > > >> > > > than
>> > > >> > > > > setting up authorization and REST calls coming from an
>> external
>> > > >> system,
>> > > >> > > > we
>> > > >> > > > > can utilize Connections of Airlfow to authorize such a
>> Trigger
>> > > to
>> > > >> > > > subscribe
>> > > >> > > > > to events.
>> > > >> > > > >
>> > > >> > > > > For the push proposal -  as I read the proposal, the main
>> point
>> > > >> behind it
>> > > >> > > > > is rather than users having to write "Airflow" way of
>> triggering
>> > > >> events
>> > > >> > > > and
>> > > >> > > > > configuring authentication (using REST API) to generate
>> asset
>> > > >> events, is
>> > > >> > > > to
>> > > >> > > > > make Airflow natively understand external ways of pushing
>> - and
>> > > >> > > > effectively
>> > > >> > > > > authorizing and mapping such incoming unauthorized
>> requests into
>> > > >> event
>> > > >> > > > that
>> > > >> > > > > could be generated by an API REST call.
>> > > >> > > > > I am not really sure honestly if this is something that we
>> want
>> > > as
>> > > >> > > > > "running" in airlfow as an endpoint. I'd say such an
>> > > unauthorised
>> > > >> > > > endpoint
>> > > >> > > > > is probably not a good idea - for a variety of reasons,
>> mostly
>> > > >> security.
>> > > >> > > > > And as I understand the goal is that users can easily
>> point at
>> > > >> > > > "3rd-party"
>> > > >> > > > > notification to Airflow and get the event generated.
>> > > >> > > > >
>> > > >> > > > > My feeling is that while this is needed - it should be
>> > > >> externalised from
>> > > >> > > > > airlfow webserver. The authorization has to be set up
>> anyway
>> > > >> > > > additionally -
>> > > >> > > > > unlike in "poll" case - we cannot use Connections for
>> > > authorizing
>> > > >> > > > (because
>> > > >> > > > > it's not Airlfow that authorizes in an external system -
>> it's
>> > > the
>> > > >> other
>> > > >> > > > way
>> > > >> > > > > round). So we have to anyhow setup "something extra" in
>> Airflow
>> > > to
>> > > >> > > > > authorize the external system. Which could be what we have
>> now -
>> > > >> user
>> > > >> > > > that
>> > > >> > > > > allows us to trigger the event. Which means that our REST
>> API
>> > > >> could
>> > > >> > > > > potentially be used the same way it is now, but we will
>> need
>> > > >> "something"
>> > > >> > > > > (library, lambda function etc.) that users could easily
>> setup in
>> > > >> the
>> > > >> > > > > external system to map whatever trigger they generate
>> natively
>> > > >> (say S3
>> > > >> > > > file
>> > > >> > > > > created) to Airflow REST API.
>> > > >> > > > >
>> > > >> > > > > As I see it - this is quite often used (and very
>> practical, that
>> > > >> you
>> > > >> > > > deploy
>> > > >> > > > > a cloud function or lambda that subscribes on the event
>> received
>> > > >> when
>> > > >> > > > > S3/GCS is created. So it would be on the user to deploy
>> such a
>> > > >> lambda -
>> > > >> > > > but
>> > > >> > > > > we **could** provide a library of those: say s3 lambda, gcp
>> > > cloud
>> > > >> > > > function
>> > > >> > > > > in respective providers - with documentation how to set
>> them up,
>> > > >> and how
>> > > >> > > > to
>> > > >> > > > > configure authorization and we would be generally "done".
>> I am
>> > > >> just not
>> > > >> > > > > sure if we need a new entity in Airflow for that (Event
>> > > >> receiver). It
>> > > >> > > > feels
>> > > >> > > > > like it asks Airflow to take more responsibility, when we
>> all
>> > > >> think on
>> > > >> > > > what
>> > > >> > > > > to "remove" from Airflow rather than "add" to it -
>> especially
>> > > >> when it
>> > > >> > > > comes
>> > > >> > > > > to external integrations. It feels to me that Airflow
>> should
>> > > make
>> > > >> it easy
>> > > >> > > > > to be triggered by such an external system and make it
>> easy to
>> > > >> "map" to
>> > > >> > > > the
>> > > >> > > > > way we expect to get events triggered, but this should be
>> done
>> > > >> outside of
>> > > >> > > > > Airflow. If the users can easily find in our docs when they
>> > > >> search "what
>> > > >> > > > do
>> > > >> > > > > I do to externally trigger Airflow on S3 change": either a)
>> > > >> configure
>> > > >> > > > > polling in airflow using s3 Connection, or b) "create a
>> user +
>> > > >> deploy
>> > > >> > > > this
>> > > >> > > > > lambda with those parameters"  - that is "easy enough" and
>> very
>> > > >> practical
>> > > >> > > > > as well.
>> > > >> > > > >
>> > > >> > > > > But maybe I am not seeing the whole picture and the real
>> problem
>> > > >> it's
>> > > >> > > > > solving - so take it as a "first review pass" and "guts
>> > > feeling".
>> > > >> > > > >
>> > > >> > > > > J.
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > > On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent
>> > > >> > > > <vincb...@amazon.com.invalid
>> > > >> > > > > >
>> > > >> > > > > wrote:
>> > > >> > > > >
>> > > >> > > > > > Hello everyone,
>> > > >> > > > > >
>> > > >> > > > > > I created a draft AIP regarding "External event driven
>> > > >> scheduling in
>> > > >> > > > > > Airflow". This proposal is about adding capability in
>> Airflow
>> > > to
>> > > >> > > > schedule
>> > > >> > > > > > DAGs based on external events. Here are some examples of
>> such
>> > > >> external
>> > > >> > > > > > events:
>> > > >> > > > > > - A user signs up to one of the user pool defined in my
>> cloud
>> > > >> provider
>> > > >> > > > > > - One of the databases used in my company has been
>> updated
>> > > >> > > > > > - A job in my cloud provider has been executed
>> successfully
>> > > >> > > > > >
>> > > >> > > > > > The intent of this AIP is to leverage datasets (which
>> will be
>> > > >> soon
>> > > >> > > > > assets)
>> > > >> > > > > > and update them based on external events. I would like to
>> > > >> propose this
>> > > >> > > > > AIP
>> > > >> > > > > > for discussion and more importantly, hear some feedbacks
>> from
>> > > >> you :)
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >>
>> > >
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-82%2BExternal%2Bevent%2Bdriven%2Bscheduling%2Bin%2BAirflow&data=05%7C02%7CJens.Scheffler%40de.bosch.com%7C9e55ef9af31e4a669ef108dcada3a726%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638576165598178951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=3FFvhCI6RA6sPhZoiOBAqzgyTkC6NNYqJYjBRVqEmUY%3D&reserved=0
>> > > >> > > > <
>> > > >> > > >
>> > > >>
>> > >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow
>> > > >> > > > >
>> > > >> > > > > >
>> > > >> > > > > > Vincent
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >> >
>> ---------------------------------------------------------------------
>> > > >> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> > > >> > For additional commands, e-mail: dev-h...@airflow.apache.org
>> > > >> >
>> > > >> >
>> > > >>
>> > > >>
>> ---------------------------------------------------------------------
>> > > >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> > > >> For additional commands, e-mail: dev-h...@airflow.apache.org
>> > > >>
>> > > >>
>> > >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>
>>

Reply via email to