Vincent,

Thanks for writing this up. The overview looks really good!

I will leave my comments in the AIP as well, but at a high level they are
both relatively focused on the "how", rather than the "what".
With respect to the pull / polling approach, I completely agree that some
incarnation of this is needed.
I am less certain as to how on this part. The bespoke triggerer approach
completely makes sense for the long tail here, but can we do better for the
20% of scenarios which cover well over 80% of usage here is the question in
my mind. Or, are you thinking of those as being covered in the "push"
model?

Which leads to the "push" model approach.
I am struggling with the same question that Jarek raised here about whether
we need a new Airflow entity over and beyond the existing REST API for the
same.
I am concerned about this becoming a vector of attack on Airflow.
I see that this is a hot topic of discussion in the Confluence doc as well,
but wanted to summarize here as well, so it didn't get lost in the threads
of comments.

Best regards,
Vikram


On Fri, Jul 26, 2024 at 5:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Thanks Vincent. I took a look and I have a general comment. I
> strongly think external driven scheduling is really needed - especially, it
> should be much easier for a user to "plug-in" such an external event to
> Airflow. And there are two parts of it - as correctly stated there - pull
> and push.
>
> For the pull - I think it would be great to have a kind of specialized
> Triggers that will be started when DAG is parsed - and those Triggers could
> generate the events for DAGs. I think basically that's all that is needed,
> for example I imagine a pubsub trigger that will subscribe to messages
> coming on the pubsub queue and fire "Asset" event when a message is
> received. Not much controversy there - I am not sure about the polling
> thing , because I've always believed that when "asyncio-native" Trigger is
> run in the asyncio event loop, we do not "poll" every second or so (but
> maybe this is just coming from some specific triggers  that actually do
> such regular poll. But yes - there are polls  like running select on the DB
> that cannot be easily "async-ed" so having a configurable polling time
> would be good there (but I am not sure maybe it's even possible today). I
> think this would be really great if we have that option, because it makes
> it much easier to set up the authorization for Airlfow users - rather than
> setting up authorization and REST calls coming from an external system, we
> can utilize Connections of Airlfow to authorize such a Trigger to subscribe
> to events.
>
> For the push proposal -  as I read the proposal, the main point behind it
> is rather than users having to write "Airflow" way of triggering events and
> configuring authentication (using REST API) to generate asset events, is to
> make Airflow natively understand external ways of pushing - and effectively
> authorizing and mapping such incoming unauthorized requests into event that
> could be generated by an API REST call.
> I am not really sure honestly if this is something that we want as
> "running" in airlfow as an endpoint. I'd say such an unauthorised endpoint
> is probably not a good idea - for a variety of reasons, mostly security.
> And as I understand the goal is that users can easily point at "3rd-party"
> notification to Airflow and get the event generated.
>
> My feeling is that while this is needed - it should be externalised from
> airlfow webserver. The authorization has to be set up anyway additionally -
> unlike in "poll" case - we cannot use Connections for authorizing (because
> it's not Airlfow that authorizes in an external system - it's the other way
> round). So we have to anyhow setup "something extra" in Airflow to
> authorize the external system. Which could be what we have now - user that
> allows us to trigger the event. Which means that our REST API could
> potentially be used the same way it is now, but we will need "something"
> (library, lambda function etc.) that users could easily setup in the
> external system to map whatever trigger they generate natively (say S3 file
> created) to Airflow REST API.
>
> As I see it - this is quite often used (and very practical, that you deploy
> a cloud function or lambda that subscribes on the event received when
> S3/GCS is created. So it would be on the user to deploy such a lambda - but
> we **could** provide a library of those: say s3 lambda, gcp cloud function
> in respective providers - with documentation how to set them up, and how to
> configure authorization and we would be generally "done". I am just not
> sure if we need a new entity in Airflow for that (Event receiver). It feels
> like it asks Airflow to take more responsibility, when we all think on what
> to "remove" from Airflow rather than "add" to it - especially when it comes
> to external integrations. It feels to me that Airflow should make it easy
> to be triggered by such an external system and make it easy to "map" to the
> way we expect to get events triggered, but this should be done outside of
> Airflow. If the users can easily find in our docs when they search "what do
> I do to externally trigger Airflow on S3 change": either a) configure
> polling in airflow using s3 Connection, or b) "create a user + deploy this
> lambda with those parameters"  - that is "easy enough" and very practical
> as well.
>
> But maybe I am not seeing the whole picture and the real problem it's
> solving - so take it as a "first review pass" and "guts feeling".
>
> J.
>
>
>
>
> On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent <vincb...@amazon.com.invalid
> >
> wrote:
>
> > Hello everyone,
> >
> > I created a draft AIP regarding "External event driven scheduling in
> > Airflow". This proposal is about adding capability in Airflow to schedule
> > DAGs based on external events. Here are some examples of such external
> > events:
> > - A user signs up to one of the user pool defined in my cloud provider
> > - One of the databases used in my company has been updated
> > - A job in my cloud provider has been executed successfully
> >
> > The intent of this AIP is to leverage datasets (which will be soon
> assets)
> > and update them based on external events. I would like to propose this
> AIP
> > for discussion and more importantly, hear some feedbacks from you :)
> >
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow
> >
> > Vincent
> >
>

Reply via email to