Thanks Vincent for driving these, I have added my comments to the AIP too.

Regards,
Kaxil

On Fri, 26 Jul 2024 at 20:16, Scheffler Jens (XC-AS/EAE-ADA-T)
<jens.scheff...@de.bosch.com.invalid> wrote:

> +1 on the comments of Vikram and Jarek, added main points on confluence
>
> Sent from Outlook for iOS<https://aka.ms/o0ukef>
> ________________________________
> From: Vikram Koka <vik...@astronomer.io.INVALID>
> Sent: Friday, July 26, 2024 8:46:55 PM
> To: dev@airflow.apache.org <dev@airflow.apache.org>
> Subject: Re: [DISCUSS] External event driven scheduling in Airflow
>
> Vincent,
>
> Thanks for writing this up. The overview looks really good!
>
> I will leave my comments in the AIP as well, but at a high level they are
> both relatively focused on the "how", rather than the "what".
> With respect to the pull / polling approach, I completely agree that some
> incarnation of this is needed.
> I am less certain as to how on this part. The bespoke triggerer approach
> completely makes sense for the long tail here, but can we do better for the
> 20% of scenarios which cover well over 80% of usage here is the question in
> my mind. Or, are you thinking of those as being covered in the "push"
> model?
>
> Which leads to the "push" model approach.
> I am struggling with the same question that Jarek raised here about whether
> we need a new Airflow entity over and beyond the existing REST API for the
> same.
> I am concerned about this becoming a vector of attack on Airflow.
> I see that this is a hot topic of discussion in the Confluence doc as well,
> but wanted to summarize here as well, so it didn't get lost in the threads
> of comments.
>
> Best regards,
> Vikram
>
>
> On Fri, Jul 26, 2024 at 5:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > Thanks Vincent. I took a look and I have a general comment. I
> > strongly think external driven scheduling is really needed - especially,
> it
> > should be much easier for a user to "plug-in" such an external event to
> > Airflow. And there are two parts of it - as correctly stated there - pull
> > and push.
> >
> > For the pull - I think it would be great to have a kind of specialized
> > Triggers that will be started when DAG is parsed - and those Triggers
> could
> > generate the events for DAGs. I think basically that's all that is
> needed,
> > for example I imagine a pubsub trigger that will subscribe to messages
> > coming on the pubsub queue and fire "Asset" event when a message is
> > received. Not much controversy there - I am not sure about the polling
> > thing , because I've always believed that when "asyncio-native" Trigger
> is
> > run in the asyncio event loop, we do not "poll" every second or so (but
> > maybe this is just coming from some specific triggers  that actually do
> > such regular poll. But yes - there are polls  like running select on the
> DB
> > that cannot be easily "async-ed" so having a configurable polling time
> > would be good there (but I am not sure maybe it's even possible today). I
> > think this would be really great if we have that option, because it makes
> > it much easier to set up the authorization for Airlfow users - rather
> than
> > setting up authorization and REST calls coming from an external system,
> we
> > can utilize Connections of Airlfow to authorize such a Trigger to
> subscribe
> > to events.
> >
> > For the push proposal -  as I read the proposal, the main point behind it
> > is rather than users having to write "Airflow" way of triggering events
> and
> > configuring authentication (using REST API) to generate asset events, is
> to
> > make Airflow natively understand external ways of pushing - and
> effectively
> > authorizing and mapping such incoming unauthorized requests into event
> that
> > could be generated by an API REST call.
> > I am not really sure honestly if this is something that we want as
> > "running" in airlfow as an endpoint. I'd say such an unauthorised
> endpoint
> > is probably not a good idea - for a variety of reasons, mostly security.
> > And as I understand the goal is that users can easily point at
> "3rd-party"
> > notification to Airflow and get the event generated.
> >
> > My feeling is that while this is needed - it should be externalised from
> > airlfow webserver. The authorization has to be set up anyway
> additionally -
> > unlike in "poll" case - we cannot use Connections for authorizing
> (because
> > it's not Airlfow that authorizes in an external system - it's the other
> way
> > round). So we have to anyhow setup "something extra" in Airflow to
> > authorize the external system. Which could be what we have now - user
> that
> > allows us to trigger the event. Which means that our REST API could
> > potentially be used the same way it is now, but we will need "something"
> > (library, lambda function etc.) that users could easily setup in the
> > external system to map whatever trigger they generate natively (say S3
> file
> > created) to Airflow REST API.
> >
> > As I see it - this is quite often used (and very practical, that you
> deploy
> > a cloud function or lambda that subscribes on the event received when
> > S3/GCS is created. So it would be on the user to deploy such a lambda -
> but
> > we **could** provide a library of those: say s3 lambda, gcp cloud
> function
> > in respective providers - with documentation how to set them up, and how
> to
> > configure authorization and we would be generally "done". I am just not
> > sure if we need a new entity in Airflow for that (Event receiver). It
> feels
> > like it asks Airflow to take more responsibility, when we all think on
> what
> > to "remove" from Airflow rather than "add" to it - especially when it
> comes
> > to external integrations. It feels to me that Airflow should make it easy
> > to be triggered by such an external system and make it easy to "map" to
> the
> > way we expect to get events triggered, but this should be done outside of
> > Airflow. If the users can easily find in our docs when they search "what
> do
> > I do to externally trigger Airflow on S3 change": either a) configure
> > polling in airflow using s3 Connection, or b) "create a user + deploy
> this
> > lambda with those parameters"  - that is "easy enough" and very practical
> > as well.
> >
> > But maybe I am not seeing the whole picture and the real problem it's
> > solving - so take it as a "first review pass" and "guts feeling".
> >
> > J.
> >
> >
> >
> >
> > On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent
> <vincb...@amazon.com.invalid
> > >
> > wrote:
> >
> > > Hello everyone,
> > >
> > > I created a draft AIP regarding "External event driven scheduling in
> > > Airflow". This proposal is about adding capability in Airflow to
> schedule
> > > DAGs based on external events. Here are some examples of such external
> > > events:
> > > - A user signs up to one of the user pool defined in my cloud provider
> > > - One of the databases used in my company has been updated
> > > - A job in my cloud provider has been executed successfully
> > >
> > > The intent of this AIP is to leverage datasets (which will be soon
> > assets)
> > > and update them based on external events. I would like to propose this
> > AIP
> > > for discussion and more importantly, hear some feedbacks from you :)
> > >
> > >
> > >
> >
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-82%2BExternal%2Bevent%2Bdriven%2Bscheduling%2Bin%2BAirflow&data=05%7C02%7CJens.Scheffler%40de.bosch.com%7C9e55ef9af31e4a669ef108dcada3a726%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638576165598178951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=3FFvhCI6RA6sPhZoiOBAqzgyTkC6NNYqJYjBRVqEmUY%3D&reserved=0
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow
> >
> > >
> > > Vincent
> > >
> >
>

Reply via email to