Hi everyone, I updated the AIP-82 given the different comments and concerns I received. I also tried to reply to all comments individually. I would really appreciate if you can do a second pass and let me know what you think. Overall, this is what I changed in the AIP:
- Push based event-driven scheduling. I updated this section entirely because I received many concerns about the previous proposal. The overall idea now is to leverage the create asset event API endpoint to send notifications from external (e.g. cloud provider) to Airflow environment. - I updated the poll based event-driven scheduling DAG author experience to use a message queue scenario. I understood that this is probably the main use case we are trying to cover with this AIP, thus I used it as example and mentioned it multiple times across the AIP. Thanks again for your time :) https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow Vincent On 2024/07/29 15:58:23 Vincent Beck wrote: > Thanks a lot all for the comments, this is very much appreciated! I received > many comments from this thread and in confluence, thanks again. I'll try to > address them all in the AIP and will send an email in this thread once done. > I will most likely revisit the push-based approach given the number of > concerns I received, thanks Jarek for proposing another solution, I'll > probably go down that path. > > One follow-up question Vikram. > > > The bespoke triggerer approach completely makes sense for the long tail > > here, but can we do better for the 20% of scenarios which cover well over > > 80% of usage here is the question in my mind. Or, are you thinking of those > > as being covered in the "push" model? > > Could you share more details about what is this "20% of scenarios which cover > well over 80% of usage" please? > > Vincent > > On 2024/07/29 15:37:50 Kaxil Naik wrote: > > Thanks Vincent for driving these, I have added my comments to the AIP too. > > > > Regards, > > Kaxil > > > > On Fri, 26 Jul 2024 at 20:16, Scheffler Jens (XC-AS/EAE-ADA-T) > > <jens.scheff...@de.bosch.com.invalid> wrote: > > > > > +1 on the comments of Vikram and Jarek, added main points on confluence > > > > > > Sent from Outlook for iOS<https://aka.ms/o0ukef> > > > ________________________________ > > > From: Vikram Koka <vik...@astronomer.io.INVALID> > > > Sent: Friday, July 26, 2024 8:46:55 PM > > > To: dev@airflow.apache.org <dev@airflow.apache.org> > > > Subject: Re: [DISCUSS] External event driven scheduling in Airflow > > > > > > Vincent, > > > > > > Thanks for writing this up. The overview looks really good! > > > > > > I will leave my comments in the AIP as well, but at a high level they are > > > both relatively focused on the "how", rather than the "what". > > > With respect to the pull / polling approach, I completely agree that some > > > incarnation of this is needed. > > > I am less certain as to how on this part. The bespoke triggerer approach > > > completely makes sense for the long tail here, but can we do better for > > > the > > > 20% of scenarios which cover well over 80% of usage here is the question > > > in > > > my mind. Or, are you thinking of those as being covered in the "push" > > > model? > > > > > > Which leads to the "push" model approach. > > > I am struggling with the same question that Jarek raised here about > > > whether > > > we need a new Airflow entity over and beyond the existing REST API for the > > > same. > > > I am concerned about this becoming a vector of attack on Airflow. > > > I see that this is a hot topic of discussion in the Confluence doc as > > > well, > > > but wanted to summarize here as well, so it didn't get lost in the threads > > > of comments. > > > > > > Best regards, > > > Vikram > > > > > > > > > On Fri, Jul 26, 2024 at 5:16 AM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > Thanks Vincent. I took a look and I have a general comment. I > > > > strongly think external driven scheduling is really needed - especially, > > > it > > > > should be much easier for a user to "plug-in" such an external event to > > > > Airflow. And there are two parts of it - as correctly stated there - > > > > pull > > > > and push. > > > > > > > > For the pull - I think it would be great to have a kind of specialized > > > > Triggers that will be started when DAG is parsed - and those Triggers > > > could > > > > generate the events for DAGs. I think basically that's all that is > > > needed, > > > > for example I imagine a pubsub trigger that will subscribe to messages > > > > coming on the pubsub queue and fire "Asset" event when a message is > > > > received. Not much controversy there - I am not sure about the polling > > > > thing , because I've always believed that when "asyncio-native" Trigger > > > is > > > > run in the asyncio event loop, we do not "poll" every second or so (but > > > > maybe this is just coming from some specific triggers that actually do > > > > such regular poll. But yes - there are polls like running select on the > > > DB > > > > that cannot be easily "async-ed" so having a configurable polling time > > > > would be good there (but I am not sure maybe it's even possible today). > > > > I > > > > think this would be really great if we have that option, because it > > > > makes > > > > it much easier to set up the authorization for Airlfow users - rather > > > than > > > > setting up authorization and REST calls coming from an external system, > > > we > > > > can utilize Connections of Airlfow to authorize such a Trigger to > > > subscribe > > > > to events. > > > > > > > > For the push proposal - as I read the proposal, the main point behind > > > > it > > > > is rather than users having to write "Airflow" way of triggering events > > > and > > > > configuring authentication (using REST API) to generate asset events, is > > > to > > > > make Airflow natively understand external ways of pushing - and > > > effectively > > > > authorizing and mapping such incoming unauthorized requests into event > > > that > > > > could be generated by an API REST call. > > > > I am not really sure honestly if this is something that we want as > > > > "running" in airlfow as an endpoint. I'd say such an unauthorised > > > endpoint > > > > is probably not a good idea - for a variety of reasons, mostly security. > > > > And as I understand the goal is that users can easily point at > > > "3rd-party" > > > > notification to Airflow and get the event generated. > > > > > > > > My feeling is that while this is needed - it should be externalised from > > > > airlfow webserver. The authorization has to be set up anyway > > > additionally - > > > > unlike in "poll" case - we cannot use Connections for authorizing > > > (because > > > > it's not Airlfow that authorizes in an external system - it's the other > > > way > > > > round). So we have to anyhow setup "something extra" in Airflow to > > > > authorize the external system. Which could be what we have now - user > > > that > > > > allows us to trigger the event. Which means that our REST API could > > > > potentially be used the same way it is now, but we will need "something" > > > > (library, lambda function etc.) that users could easily setup in the > > > > external system to map whatever trigger they generate natively (say S3 > > > file > > > > created) to Airflow REST API. > > > > > > > > As I see it - this is quite often used (and very practical, that you > > > deploy > > > > a cloud function or lambda that subscribes on the event received when > > > > S3/GCS is created. So it would be on the user to deploy such a lambda - > > > but > > > > we **could** provide a library of those: say s3 lambda, gcp cloud > > > function > > > > in respective providers - with documentation how to set them up, and how > > > to > > > > configure authorization and we would be generally "done". I am just not > > > > sure if we need a new entity in Airflow for that (Event receiver). It > > > feels > > > > like it asks Airflow to take more responsibility, when we all think on > > > what > > > > to "remove" from Airflow rather than "add" to it - especially when it > > > comes > > > > to external integrations. It feels to me that Airflow should make it > > > > easy > > > > to be triggered by such an external system and make it easy to "map" to > > > the > > > > way we expect to get events triggered, but this should be done outside > > > > of > > > > Airflow. If the users can easily find in our docs when they search "what > > > do > > > > I do to externally trigger Airflow on S3 change": either a) configure > > > > polling in airflow using s3 Connection, or b) "create a user + deploy > > > this > > > > lambda with those parameters" - that is "easy enough" and very > > > > practical > > > > as well. > > > > > > > > But maybe I am not seeing the whole picture and the real problem it's > > > > solving - so take it as a "first review pass" and "guts feeling". > > > > > > > > J. > > > > > > > > > > > > > > > > > > > > On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent > > > <vincb...@amazon.com.invalid > > > > > > > > > wrote: > > > > > > > > > Hello everyone, > > > > > > > > > > I created a draft AIP regarding "External event driven scheduling in > > > > > Airflow". This proposal is about adding capability in Airflow to > > > schedule > > > > > DAGs based on external events. Here are some examples of such external > > > > > events: > > > > > - A user signs up to one of the user pool defined in my cloud provider > > > > > - One of the databases used in my company has been updated > > > > > - A job in my cloud provider has been executed successfully > > > > > > > > > > The intent of this AIP is to leverage datasets (which will be soon > > > > assets) > > > > > and update them based on external events. I would like to propose this > > > > AIP > > > > > for discussion and more importantly, hear some feedbacks from you :) > > > > > > > > > > > > > > > > > > > > > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-82%2BExternal%2Bevent%2Bdriven%2Bscheduling%2Bin%2BAirflow&data=05%7C02%7CJens.Scheffler%40de.bosch.com%7C9e55ef9af31e4a669ef108dcada3a726%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638576165598178951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=3FFvhCI6RA6sPhZoiOBAqzgyTkC6NNYqJYjBRVqEmUY%3D&reserved=0 > > > < > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow > > > > > > > > > > > > > > Vincent > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > For additional commands, e-mail: dev-h...@airflow.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org