I would love for VOTE to get started on this one. I think most of the commenters and those who replied to this email are happy with the proposal on the poll-based approach.
Regarding the push-based approach, I am not convinced that the proposed implementation has any gains over what's already available with the Dataset Event Create API; the one user-to-one function mapping is an odd user experience. I'm curious to hear what others think. On Thu, 1 Aug 2024 at 17:39, Kaxil Naik <kaxiln...@gmail.com> wrote: > I agree with both of you that it is indeed a good idea and that it can be > added in Future work -- doesn't need to be part of this AIP. > > Thanks for the interest. I was not aware of such feature and this looks >> really cool! I definitely think that can be useful for Airflow, especially >> for testing when you can easily replay events received in the past. >> However, I do not think it should be part of the AIP and, as you mentioned, >> if should be a future work or a follow-up item of the AIP. Please let me >> know if you (or anyone) disagree with this and we can talk about it. >> Otherwise I'll update the future work section of the AIP and mention this >> archive and replay feature. > > > On Thu, 1 Aug 2024 at 16:11, Vincent Beck <vincb...@apache.org> wrote: > >> Hey Pavan, >> >> Thanks for the interest. I was not aware of such feature and this looks >> really cool! I definitely think that can be useful for Airflow, especially >> for testing when you can easily replay events received in the past. >> However, I do not think it should be part of the AIP and, as you mentioned, >> if should be a future work or a follow-up item of the AIP. Please let me >> know if you (or anyone) disagree with this and we can talk about it. >> Otherwise I'll update the future work section of the AIP and mention this >> archive and replay feature. >> >> On 2024/08/01 01:21:58 Pavankumar Gopidesu wrote: >> > Thanks Vincent, I took a look , this is really good. Don't have access >> to >> > the confluence page to comment :) so adding it here. >> > >> > As events arrive-->do somework-->end. >> > >> > So I'm uncertain if my comment pertains to the current poll/push model >> or >> > if it fits part of future work(seen event batching ). >> > >> > Have you given any thought to the event archival mechanism and event >> > replay? This could significantly aid in testing and recovery of workflow >> > and testing new functionality with events by just replay the events. The >> > archival mechanism I am referring to is similar to today in AWS we have >> > Event Bridge Archive and Replay. >> > >> > Regards, >> > Pavan >> > >> > On Thu, Aug 1, 2024 at 1:29 AM Kaxil Naik <kaxiln...@gmail.com> wrote: >> > >> > > I actually did manage to take a look, thanks for the work. I am +1 on >> the >> > > poll-based approach -- left a comment on the push-based: I am not >> sure of >> > > why we need a function since create asset event API endpoint should >> have >> > > all info needed for what the Asset was. >> > > >> > > On Thu, 1 Aug 2024 at 01:14, Kaxil Naik <kaxiln...@gmail.com> wrote: >> > > >> > > > Thanks Vincent, I will take a look again tomorrow. >> > > > >> > > > On Tue, 30 Jul 2024 at 18:47, Vincent Beck <vincb...@apache.org> >> wrote: >> > > > >> > > >> Hi everyone, >> > > >> >> > > >> I updated the AIP-82 given the different comments and concerns I >> > > >> received. I also tried to reply to all comments individually. I >> would >> > > >> really appreciate if you can do a second pass and let me know what >> you >> > > >> think. Overall, this is what I changed in the AIP: >> > > >> >> > > >> - Push based event-driven scheduling. I updated this section >> entirely >> > > >> because I received many concerns about the previous proposal. The >> > > overall >> > > >> idea now is to leverage the create asset event API endpoint to send >> > > >> notifications from external (e.g. cloud provider) to Airflow >> > > environment. >> > > >> >> > > >> - I updated the poll based event-driven scheduling DAG author >> experience >> > > >> to use a message queue scenario. I understood that this is >> probably the >> > > >> main use case we are trying to cover with this AIP, thus I used it >> as >> > > >> example and mentioned it multiple times across the AIP. >> > > >> >> > > >> Thanks again for your time :) >> > > >> >> > > >> >> > > >> >> > > >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow >> > > >> >> > > >> Vincent >> > > >> >> > > >> On 2024/07/29 15:58:23 Vincent Beck wrote: >> > > >> > Thanks a lot all for the comments, this is very much >> appreciated! I >> > > >> received many comments from this thread and in confluence, thanks >> again. >> > > >> I'll try to address them all in the AIP and will send an email in >> this >> > > >> thread once done. I will most likely revisit the push-based >> approach >> > > given >> > > >> the number of concerns I received, thanks Jarek for proposing >> another >> > > >> solution, I'll probably go down that path. >> > > >> > >> > > >> > One follow-up question Vikram. >> > > >> > >> > > >> > > The bespoke triggerer approach completely makes sense for the >> long >> > > >> tail here, but can we do better for the 20% of scenarios which >> cover >> > > well >> > > >> over 80% of usage here is the question in my mind. Or, are you >> thinking >> > > of >> > > >> those as being covered in the "push" model? >> > > >> > >> > > >> > Could you share more details about what is this "20% of scenarios >> > > which >> > > >> cover well over 80% of usage" please? >> > > >> > >> > > >> > Vincent >> > > >> > >> > > >> > On 2024/07/29 15:37:50 Kaxil Naik wrote: >> > > >> > > Thanks Vincent for driving these, I have added my comments to >> the >> > > AIP >> > > >> too. >> > > >> > > >> > > >> > > Regards, >> > > >> > > Kaxil >> > > >> > > >> > > >> > > On Fri, 26 Jul 2024 at 20:16, Scheffler Jens (XC-AS/EAE-ADA-T) >> > > >> > > <jens.scheff...@de.bosch.com.invalid> wrote: >> > > >> > > >> > > >> > > > +1 on the comments of Vikram and Jarek, added main points on >> > > >> confluence >> > > >> > > > >> > > >> > > > Sent from Outlook for iOS<https://aka.ms/o0ukef> >> > > >> > > > ________________________________ >> > > >> > > > From: Vikram Koka <vik...@astronomer.io.INVALID> >> > > >> > > > Sent: Friday, July 26, 2024 8:46:55 PM >> > > >> > > > To: dev@airflow.apache.org <dev@airflow.apache.org> >> > > >> > > > Subject: Re: [DISCUSS] External event driven scheduling in >> Airflow >> > > >> > > > >> > > >> > > > Vincent, >> > > >> > > > >> > > >> > > > Thanks for writing this up. The overview looks really good! >> > > >> > > > >> > > >> > > > I will leave my comments in the AIP as well, but at a high >> level >> > > >> they are >> > > >> > > > both relatively focused on the "how", rather than the "what". >> > > >> > > > With respect to the pull / polling approach, I completely >> agree >> > > >> that some >> > > >> > > > incarnation of this is needed. >> > > >> > > > I am less certain as to how on this part. The bespoke >> triggerer >> > > >> approach >> > > >> > > > completely makes sense for the long tail here, but can we do >> > > better >> > > >> for the >> > > >> > > > 20% of scenarios which cover well over 80% of usage here is >> the >> > > >> question in >> > > >> > > > my mind. Or, are you thinking of those as being covered in >> the >> > > >> "push" >> > > >> > > > model? >> > > >> > > > >> > > >> > > > Which leads to the "push" model approach. >> > > >> > > > I am struggling with the same question that Jarek raised here >> > > about >> > > >> whether >> > > >> > > > we need a new Airflow entity over and beyond the existing >> REST API >> > > >> for the >> > > >> > > > same. >> > > >> > > > I am concerned about this becoming a vector of attack on >> Airflow. >> > > >> > > > I see that this is a hot topic of discussion in the >> Confluence doc >> > > >> as well, >> > > >> > > > but wanted to summarize here as well, so it didn't get lost >> in the >> > > >> threads >> > > >> > > > of comments. >> > > >> > > > >> > > >> > > > Best regards, >> > > >> > > > Vikram >> > > >> > > > >> > > >> > > > >> > > >> > > > On Fri, Jul 26, 2024 at 5:16 AM Jarek Potiuk < >> ja...@potiuk.com> >> > > >> wrote: >> > > >> > > > >> > > >> > > > > Thanks Vincent. I took a look and I have a general >> comment. I >> > > >> > > > > strongly think external driven scheduling is really needed >> - >> > > >> especially, >> > > >> > > > it >> > > >> > > > > should be much easier for a user to "plug-in" such an >> external >> > > >> event to >> > > >> > > > > Airflow. And there are two parts of it - as correctly >> stated >> > > >> there - pull >> > > >> > > > > and push. >> > > >> > > > > >> > > >> > > > > For the pull - I think it would be great to have a kind of >> > > >> specialized >> > > >> > > > > Triggers that will be started when DAG is parsed - and >> those >> > > >> Triggers >> > > >> > > > could >> > > >> > > > > generate the events for DAGs. I think basically that's all >> that >> > > is >> > > >> > > > needed, >> > > >> > > > > for example I imagine a pubsub trigger that will subscribe >> to >> > > >> messages >> > > >> > > > > coming on the pubsub queue and fire "Asset" event when a >> message >> > > >> is >> > > >> > > > > received. Not much controversy there - I am not sure about >> the >> > > >> polling >> > > >> > > > > thing , because I've always believed that when >> "asyncio-native" >> > > >> Trigger >> > > >> > > > is >> > > >> > > > > run in the asyncio event loop, we do not "poll" every >> second or >> > > >> so (but >> > > >> > > > > maybe this is just coming from some specific triggers that >> > > >> actually do >> > > >> > > > > such regular poll. But yes - there are polls like running >> > > select >> > > >> on the >> > > >> > > > DB >> > > >> > > > > that cannot be easily "async-ed" so having a configurable >> > > polling >> > > >> time >> > > >> > > > > would be good there (but I am not sure maybe it's even >> possible >> > > >> today). I >> > > >> > > > > think this would be really great if we have that option, >> because >> > > >> it makes >> > > >> > > > > it much easier to set up the authorization for Airlfow >> users - >> > > >> rather >> > > >> > > > than >> > > >> > > > > setting up authorization and REST calls coming from an >> external >> > > >> system, >> > > >> > > > we >> > > >> > > > > can utilize Connections of Airlfow to authorize such a >> Trigger >> > > to >> > > >> > > > subscribe >> > > >> > > > > to events. >> > > >> > > > > >> > > >> > > > > For the push proposal - as I read the proposal, the main >> point >> > > >> behind it >> > > >> > > > > is rather than users having to write "Airflow" way of >> triggering >> > > >> events >> > > >> > > > and >> > > >> > > > > configuring authentication (using REST API) to generate >> asset >> > > >> events, is >> > > >> > > > to >> > > >> > > > > make Airflow natively understand external ways of pushing >> - and >> > > >> > > > effectively >> > > >> > > > > authorizing and mapping such incoming unauthorized >> requests into >> > > >> event >> > > >> > > > that >> > > >> > > > > could be generated by an API REST call. >> > > >> > > > > I am not really sure honestly if this is something that we >> want >> > > as >> > > >> > > > > "running" in airlfow as an endpoint. I'd say such an >> > > unauthorised >> > > >> > > > endpoint >> > > >> > > > > is probably not a good idea - for a variety of reasons, >> mostly >> > > >> security. >> > > >> > > > > And as I understand the goal is that users can easily >> point at >> > > >> > > > "3rd-party" >> > > >> > > > > notification to Airflow and get the event generated. >> > > >> > > > > >> > > >> > > > > My feeling is that while this is needed - it should be >> > > >> externalised from >> > > >> > > > > airlfow webserver. The authorization has to be set up >> anyway >> > > >> > > > additionally - >> > > >> > > > > unlike in "poll" case - we cannot use Connections for >> > > authorizing >> > > >> > > > (because >> > > >> > > > > it's not Airlfow that authorizes in an external system - >> it's >> > > the >> > > >> other >> > > >> > > > way >> > > >> > > > > round). So we have to anyhow setup "something extra" in >> Airflow >> > > to >> > > >> > > > > authorize the external system. Which could be what we have >> now - >> > > >> user >> > > >> > > > that >> > > >> > > > > allows us to trigger the event. Which means that our REST >> API >> > > >> could >> > > >> > > > > potentially be used the same way it is now, but we will >> need >> > > >> "something" >> > > >> > > > > (library, lambda function etc.) that users could easily >> setup in >> > > >> the >> > > >> > > > > external system to map whatever trigger they generate >> natively >> > > >> (say S3 >> > > >> > > > file >> > > >> > > > > created) to Airflow REST API. >> > > >> > > > > >> > > >> > > > > As I see it - this is quite often used (and very >> practical, that >> > > >> you >> > > >> > > > deploy >> > > >> > > > > a cloud function or lambda that subscribes on the event >> received >> > > >> when >> > > >> > > > > S3/GCS is created. So it would be on the user to deploy >> such a >> > > >> lambda - >> > > >> > > > but >> > > >> > > > > we **could** provide a library of those: say s3 lambda, gcp >> > > cloud >> > > >> > > > function >> > > >> > > > > in respective providers - with documentation how to set >> them up, >> > > >> and how >> > > >> > > > to >> > > >> > > > > configure authorization and we would be generally "done". >> I am >> > > >> just not >> > > >> > > > > sure if we need a new entity in Airflow for that (Event >> > > >> receiver). It >> > > >> > > > feels >> > > >> > > > > like it asks Airflow to take more responsibility, when we >> all >> > > >> think on >> > > >> > > > what >> > > >> > > > > to "remove" from Airflow rather than "add" to it - >> especially >> > > >> when it >> > > >> > > > comes >> > > >> > > > > to external integrations. It feels to me that Airflow >> should >> > > make >> > > >> it easy >> > > >> > > > > to be triggered by such an external system and make it >> easy to >> > > >> "map" to >> > > >> > > > the >> > > >> > > > > way we expect to get events triggered, but this should be >> done >> > > >> outside of >> > > >> > > > > Airflow. If the users can easily find in our docs when they >> > > >> search "what >> > > >> > > > do >> > > >> > > > > I do to externally trigger Airflow on S3 change": either a) >> > > >> configure >> > > >> > > > > polling in airflow using s3 Connection, or b) "create a >> user + >> > > >> deploy >> > > >> > > > this >> > > >> > > > > lambda with those parameters" - that is "easy enough" and >> very >> > > >> practical >> > > >> > > > > as well. >> > > >> > > > > >> > > >> > > > > But maybe I am not seeing the whole picture and the real >> problem >> > > >> it's >> > > >> > > > > solving - so take it as a "first review pass" and "guts >> > > feeling". >> > > >> > > > > >> > > >> > > > > J. >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent >> > > >> > > > <vincb...@amazon.com.invalid >> > > >> > > > > > >> > > >> > > > > wrote: >> > > >> > > > > >> > > >> > > > > > Hello everyone, >> > > >> > > > > > >> > > >> > > > > > I created a draft AIP regarding "External event driven >> > > >> scheduling in >> > > >> > > > > > Airflow". This proposal is about adding capability in >> Airflow >> > > to >> > > >> > > > schedule >> > > >> > > > > > DAGs based on external events. Here are some examples of >> such >> > > >> external >> > > >> > > > > > events: >> > > >> > > > > > - A user signs up to one of the user pool defined in my >> cloud >> > > >> provider >> > > >> > > > > > - One of the databases used in my company has been >> updated >> > > >> > > > > > - A job in my cloud provider has been executed >> successfully >> > > >> > > > > > >> > > >> > > > > > The intent of this AIP is to leverage datasets (which >> will be >> > > >> soon >> > > >> > > > > assets) >> > > >> > > > > > and update them based on external events. I would like to >> > > >> propose this >> > > >> > > > > AIP >> > > >> > > > > > for discussion and more importantly, hear some feedbacks >> from >> > > >> you :) >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> >> > > >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-82%2BExternal%2Bevent%2Bdriven%2Bscheduling%2Bin%2BAirflow&data=05%7C02%7CJens.Scheffler%40de.bosch.com%7C9e55ef9af31e4a669ef108dcada3a726%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638576165598178951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=3FFvhCI6RA6sPhZoiOBAqzgyTkC6NNYqJYjBRVqEmUY%3D&reserved=0 >> > > >> > > > < >> > > >> > > > >> > > >> >> > > >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow >> > > >> > > > > >> > > >> > > > > > >> > > >> > > > > > Vincent >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> > >> --------------------------------------------------------------------- >> > > >> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> > > >> > For additional commands, e-mail: dev-h...@airflow.apache.org >> > > >> > >> > > >> > >> > > >> >> > > >> >> --------------------------------------------------------------------- >> > > >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> > > >> For additional commands, e-mail: dev-h...@airflow.apache.org >> > > >> >> > > >> >> > > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> For additional commands, e-mail: dev-h...@airflow.apache.org >> >>