Thanks Vincent. I took a look and I have a general comment. I
strongly think external driven scheduling is really needed - especially, it
should be much easier for a user to "plug-in" such an external event to
Airflow. And there are two parts of it - as correctly stated there - pull
and push.

For the pull - I think it would be great to have a kind of specialized
Triggers that will be started when DAG is parsed - and those Triggers could
generate the events for DAGs. I think basically that's all that is needed,
for example I imagine a pubsub trigger that will subscribe to messages
coming on the pubsub queue and fire "Asset" event when a message is
received. Not much controversy there - I am not sure about the polling
thing , because I've always believed that when "asyncio-native" Trigger is
run in the asyncio event loop, we do not "poll" every second or so (but
maybe this is just coming from some specific triggers  that actually do
such regular poll. But yes - there are polls  like running select on the DB
that cannot be easily "async-ed" so having a configurable polling time
would be good there (but I am not sure maybe it's even possible today). I
think this would be really great if we have that option, because it makes
it much easier to set up the authorization for Airlfow users - rather than
setting up authorization and REST calls coming from an external system, we
can utilize Connections of Airlfow to authorize such a Trigger to subscribe
to events.

For the push proposal -  as I read the proposal, the main point behind it
is rather than users having to write "Airflow" way of triggering events and
configuring authentication (using REST API) to generate asset events, is to
make Airflow natively understand external ways of pushing - and effectively
authorizing and mapping such incoming unauthorized requests into event that
could be generated by an API REST call.
I am not really sure honestly if this is something that we want as
"running" in airlfow as an endpoint. I'd say such an unauthorised endpoint
is probably not a good idea - for a variety of reasons, mostly security.
And as I understand the goal is that users can easily point at "3rd-party"
notification to Airflow and get the event generated.

My feeling is that while this is needed - it should be externalised from
airlfow webserver. The authorization has to be set up anyway additionally -
unlike in "poll" case - we cannot use Connections for authorizing (because
it's not Airlfow that authorizes in an external system - it's the other way
round). So we have to anyhow setup "something extra" in Airflow to
authorize the external system. Which could be what we have now - user that
allows us to trigger the event. Which means that our REST API could
potentially be used the same way it is now, but we will need "something"
(library, lambda function etc.) that users could easily setup in the
external system to map whatever trigger they generate natively (say S3 file
created) to Airflow REST API.

As I see it - this is quite often used (and very practical, that you deploy
a cloud function or lambda that subscribes on the event received when
S3/GCS is created. So it would be on the user to deploy such a lambda - but
we **could** provide a library of those: say s3 lambda, gcp cloud function
in respective providers - with documentation how to set them up, and how to
configure authorization and we would be generally "done". I am just not
sure if we need a new entity in Airflow for that (Event receiver). It feels
like it asks Airflow to take more responsibility, when we all think on what
to "remove" from Airflow rather than "add" to it - especially when it comes
to external integrations. It feels to me that Airflow should make it easy
to be triggered by such an external system and make it easy to "map" to the
way we expect to get events triggered, but this should be done outside of
Airflow. If the users can easily find in our docs when they search "what do
I do to externally trigger Airflow on S3 change": either a) configure
polling in airflow using s3 Connection, or b) "create a user + deploy this
lambda with those parameters"  - that is "easy enough" and very practical
as well.

But maybe I am not seeing the whole picture and the real problem it's
solving - so take it as a "first review pass" and "guts feeling".

J.




On Thu, Jul 25, 2024 at 10:55 PM Beck, Vincent <vincb...@amazon.com.invalid>
wrote:

> Hello everyone,
>
> I created a draft AIP regarding "External event driven scheduling in
> Airflow". This proposal is about adding capability in Airflow to schedule
> DAGs based on external events. Here are some examples of such external
> events:
> - A user signs up to one of the user pool defined in my cloud provider
> - One of the databases used in my company has been updated
> - A job in my cloud provider has been executed successfully
>
> The intent of this AIP is to leverage datasets (which will be soon assets)
> and update them based on external events. I would like to propose this AIP
> for discussion and more importantly, hear some feedbacks from you :)
>
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow
>
> Vincent
>

Reply via email to