[
https://issues.apache.org/jira/browse/AIRFLOW-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779369#comment-16779369
]
gsemet commented on AIRFLOW-1847:
---------------------------------
Yes, but since the web server is already a web server, i would like it (or a
minibackend behind the extension) to provide a webhook for that.
Also, bust treatment need to be adressed somehow, so we would need a kind of
queue. Overall this very classic scheme should be handled by the webhook sensor
proposal.
> Webhook Sensor
> --------------
>
> Key: AIRFLOW-1847
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1847
> Project: Apache Airflow
> Issue Type: Improvement
> Components: core, operators
> Reporter: gsemet
> Assignee: gsemet
> Priority: Minor
> Labels: api, sensors, webhook
> Attachments: airflow-webhook-proposal.png
>
>
> h1. Webhook sensor
> May require a hook in the experimental API
> Register an api endpoint and wait for input on each.
> It is different than the {{dag_runs}} api in that the format is not airflow
> specific, it is just a callback web url called by an external system on some
> even with its application specific content. The content in really important
> and need to be sent to the dag (as XCom?)
> Use Case:
> - A Dag registers a WebHook sensor named {{<webhookname>}}
> - An custom endpoint is exposed at
> {{http://myairflow.server/api/experimental/webhook/<webhookname>}}.
> - I set this URL in the external system I wish to use the webhook from. Ex:
> github/gitlab project webhook
> - when the external application performs a request to this URL, this is
> automatically sent to the WebHook sensor. For simplicity, we can have a
> JsonWebHookSensor that would be able to carry any kind of json content.
> - sensor only job would be normally to trigger the exection of a DAG,
> providing it with the json content as xcom.
> If there are several requests at the same time, the system should be scalable
> enough to not die or not slow down the webui. It is also possible to
> instantiate an independant flask/gunicorn server to split the load. It would
> mean it runs on another port, but this could be just an option in the
> configuration file or even a complete independant application ({{airflow
> webhookserver}}). I saw recent changes integrated gunicorn in airflow core,
> guess it can help this use case.
> To support the charge, I think it is good that the part in the API just post
> the received request in an internal queue so the Sensor can handle them later
> without risk of missing one.
> Documentation would be updated to describe the classic scheme to implement
> this use case, which would look like:
> !airflow-webhook-proposal.png!
> I think it is good to split it into 2 DAGs, one for linear handling of the
> messages and triggering new DAG, and the processing DAG that might be
> executed in parallel.
> h2. Example usage in Sensor DAG: trigger a DAG on GitHub Push Event
> {code}
> sensor = JsonWebHookSensor(
> task_id='my_task_id',
> name="on_github_push"
> )
> .. user is responsible to triggering the processing DAG himself.
> {code}
> In my github project, I register the following URL in webhook page:
> {code}
> http://airflow.myserver.com/api/experimental/webhook/on_github_push
> {code}
> From now on, on push, github will send a [json with this
> format|https://developer.github.com/v3/activity/events/types/#pushevent] to
> the previous URL.
> The {{JsonWebHookSensor}} receives the payload, and a new dag is triggered in
> this Sensing Dag.
> h2. Documenation update
> - add new item in the [scheduling
> documentation|https://pythonhosted.org/airflow/scheduler.html] about how to
> trigger a DAG using a webhook
> - describe the sensing dag + processing dag scheme and provide the github use
> case as real life example
> h2. Possible evolutions
> - use an external queue (redis, amqp) to handle lot of events
> - subscribe in a pub/sub system such as WAMP?
> - allow batch processing (trigger processing DAG on n events or after a
> timeout, gathering n messages alltogether)
> - for higher throughput, kafka?
> - Security, authentication and other related subject might be adresses in
> another ticket.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)