Hi Everyone,

I have an enhancement proposal for the REST API service. This is based on
the observations that Airflow users want to be able to access Airflow more
easily as a platform service.

The motivation comes from the following use cases:
1. Users like data scientists want to iterate over data quickly with
interactive feedback in minutes, e.g. managing data pipelines inside
Jupyter Notebook while executing them in a remote airflow cluster.
2. Services targeting specific audiences can generate DAGs based on inputs
like user command or external triggers, and they want to be able to submit
DAGs programmatically without manual intervention.

I believe such use cases would help promote Airflow usability and gain more
customer popularity. The existing DAG repo brings considerable overhead for
such scenarios, a shared repo requires offline processes and can be slow to
rollout.

The proposal aims to provide an alternative where a DAG can be transmitted
online and here are some key points:
1. A DAG is packaged individually so that it can be distributable over the
network. For example, a DAG may be a serialized binary or a zip file.
2. The Airflow REST API is the ideal place to talk with the external world.
The API would provide a generic interface to accept DAG artifacts and
should be extensible to support different artifact formats if needed.
3. DAG persistence needs to be implemented since they are not part of the
DAG repository.
4. Same behavior for DAGs supported in API vs those defined in the repo,
i.e. users write DAGs in the same syntax, and its scheduling, execution,
and web server UI should behave the same way.

Since DAGs are written as code, running arbitrary code inside Airflow may
pose high security risks. Here are a few proposals to stop the security
breach:
1. Accept DAGs only from trusted parties. Airflow already supports
pluggable authentication modules where strong authentication such as
Kerberos can be used.
2. Execute DAG code as the API identity, i.e. A DAG created through the API
service will have run_as_user set to be the API identity.
3. To enforce data access control on DAGs, the API identity should also be
used to access the data warehouse.

We shared a demo based on a prototype implementation in the summit and some
details are described in this ppt
<https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
and would love to get feedback and comments from the community about this
initiative.

thanks
Mocheng

Reply via email to