Thanks Vincent. I took a look and I have a general comment. I strongly think external driven scheduling is really needed - especially, it should be much easier for a user to "plug-in" such an external event to Airflow. And there are two parts of it - as correctly stated there - pull and push.
For the pull - I think it would be great to have a kind of specialized Triggers that will be started when DAG is parsed - and those Triggers could generate the events for DAGs. I think basically that's all that is needed, for example I imagine a pubsub trigger that will subscribe to messages coming on the pubsub queue and fire "Asset" event when a message is received. Not much controversy there - I am not sure about the polling thing , because I've always believed that when "asyncio-native" Trigger is run in the asyncio event loop, we do not "poll" every second or so (but maybe this is just coming from some specific triggers that actually do such regular poll. But yes - there are polls like running select on the DB that cannot be easily "async-ed" so having a configurable polling time would be good there (but I am not sure maybe it's even possible today). I think this would be really great if we have that option, because it makes it much easier to set up the authorization for Airlfow users - rather than setting up authorization and REST calls coming from an external system, we can utilize Connections of Airlfow to authorize such a Trigger to subscribe to events. For the push proposal - as I read the proposal, the main point behind it is rather than users having to write "Airflow" way of triggering events and configuring authentication (using REST API) to generate asset events, is to make Airflow natively understand external ways of pushing - and effectively authorizing and mapping such incoming unauthorized requests into event that could be generated by an API REST call. I am not really sure honestly if this is something that we want as "running" in airlfow as an endpoint. I'd say such an unauthorised endpoint is probably not a good idea - for a variety of reasons, mostly security. And as I understand the goal is that users can easily point at "3rd-party" notification to Airflow and get the event generated. My feeling is that while this is needed - it should be externalised from airlfow webserver. The authorization has to be set up anyway additionally - unlike in "poll" case - we cannot use Connections for authorizing (because it's not Airlfow that authorizes in an external system - it's the other way round). So we have to anyhow setup "something extra" in Airflow to authorize the external system. Which could be what we have now - user that allows us to trigger the event. Which means that our REST API could potentially be used the same way it is now, but we will need "something" (library, lambda function etc.) that users could easily setup in the external system to map whatever trigger they generate natively (say S3 file created) to Airflow REST API. As I see it - this is quite often used (and very practical, that you deploy a cloud function or lambda that subscribes on the event received when S3/GCS is created. So it would be on the user to deploy such a lambda - but we **could** provide a library of those: say s3 lambda, gcp cloud function in respective providers - with documentation how to set them up, and how to configure authorization and we would be generally "done". I am just not sure if we need a new entity in Airflow for that (Event receiver). It feels like it asks Airflow to take more responsibility, when we all think on what to "remove" from Airflow rather than "add" to it - especially when it comes to external integrations. It feels to me that Airflow should make it easy to be triggered by such an external system and make it easy to "map" to the way we expect to get events triggered, but this should be done outside of Airflow. If the users can easily find in our docs when they search "what do I do to externally trigger Airflow on S3 change": either a) configure polling in airflow using s3 Connection, or b) "create a user + deploy this lambda with those parameters" - that is "easy enough" and very practical as well. But maybe I am not seeing the whole picture and the real problem it's solving - so take it as a "first review pass" and "guts feeling". J. On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent <vincb...@amazon.com.invalid> wrote: > Hello everyone, > > I created a draft AIP regarding "External event driven scheduling in > Airflow". This proposal is about adding capability in Airflow to schedule > DAGs based on external events. Here are some examples of such external > events: > - A user signs up to one of the user pool defined in my cloud provider > - One of the databases used in my company has been updated > - A job in my cloud provider has been executed successfully > > The intent of this AIP is to leverage datasets (which will be soon assets) > and update them based on external events. I would like to propose this AIP > for discussion and more importantly, hear some feedbacks from you :) > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow > > Vincent >