Hi Mocheng,

Please allow me to share a question first: so in your proposal, the API in
your plan is still accepting an Airflow DAG as the payload (just binarized
or compressed), right?

If that's the case, I may not be fully convinced: the objectives in your
proposal is about automation & programmatically submitting DAGs. These can
already be achieved in an efficient way through CI/CD practice + a
centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
files).

As you are already aware, allowing this via API adds additional security
concern, and I would doubt if that "breaks even".

Kindly let me know if I have missed anything or misunderstood your
proposal. Thanks.


Regards,
XD
----------------------------------------------------------------
(This is not a contribution)

On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote:

> Hi Everyone,
>
> I have an enhancement proposal for the REST API service. This is based on
> the observations that Airflow users want to be able to access Airflow more
> easily as a platform service.
>
> The motivation comes from the following use cases:
> 1. Users like data scientists want to iterate over data quickly with
> interactive feedback in minutes, e.g. managing data pipelines inside
> Jupyter Notebook while executing them in a remote airflow cluster.
> 2. Services targeting specific audiences can generate DAGs based on inputs
> like user command or external triggers, and they want to be able to submit
> DAGs programmatically without manual intervention.
>
> I believe such use cases would help promote Airflow usability and gain
> more customer popularity. The existing DAG repo brings considerable
> overhead for such scenarios, a shared repo requires offline processes and
> can be slow to rollout.
>
> The proposal aims to provide an alternative where a DAG can be transmitted
> online and here are some key points:
> 1. A DAG is packaged individually so that it can be distributable over the
> network. For example, a DAG may be a serialized binary or a zip file.
> 2. The Airflow REST API is the ideal place to talk with the external
> world. The API would provide a generic interface to accept DAG artifacts
> and should be extensible to support different artifact formats if needed.
> 3. DAG persistence needs to be implemented since they are not part of the
> DAG repository.
> 4. Same behavior for DAGs supported in API vs those defined in the repo,
> i.e. users write DAGs in the same syntax, and its scheduling, execution,
> and web server UI should behave the same way.
>
> Since DAGs are written as code, running arbitrary code inside Airflow may
> pose high security risks. Here are a few proposals to stop the security
> breach:
> 1. Accept DAGs only from trusted parties. Airflow already supports
> pluggable authentication modules where strong authentication such as
> Kerberos can be used.
> 2. Execute DAG code as the API identity, i.e. A DAG created through the
> API service will have run_as_user set to be the API identity.
> 3. To enforce data access control on DAGs, the API identity should also be
> used to access the data warehouse.
>
> We shared a demo based on a prototype implementation in the summit and
> some details are described in this ppt
> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
> and would love to get feedback and comments from the community about this
> initiative.
>
> thanks
> Mocheng
>

Reply via email to