This has been discussed several times, and I think you should rather take a
look and focus on those proposals already there:

* https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest
*
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher

Both proposals are supposed to address various caveats connected with
trying to submit Python DAG via API.

* DAG manifest was a proposal on how to add meta-data to "limit" dag to
know what kind of other resources are needed for it to run
* Remote DAG Fetcher on the other hand would allow a user to use various
mechanisms to "Pull the data" (or if you develop your fetcher in the way to
allow Push, it would also allow Push model to work if we add some async
notifications there).

I personally think using rest API to submit DAGs is a bad idea because it
is against the current security model of Airflow where Webserver (and also
REST API) has not only "READ ONLY" access to DAGs, but also has actually
"NO ACCESS" to DAGs whatsoever.
Currently, the webserver (i.e. API server) has no physical access to any
resources where executable DAG code is accessed and can be not only changed
but read directly. It only accesses the DB. Changing that is a huge change
in the security model. Actually it goes backwards to the changes we've
implemented in Airflow 1.10 initially and leaving that as the only option
in Airflow 2 (introducing DAG serialisation) where we specifically put a
lot of effort to remove the need for the webserver to access DAG files -
and security model we chose was the main driver for that.

Making it possible to submit a new executable code via REST API of Airflow
would significantly increase dangers of exposing the API and make it an
order of magnitude more serious point of attack for an attacker. Basically
you are allowing the person who has access to API to submit an executable
code that should be executable by DAG file processor and worker.
Due to this - I don't think using REST API for that is a good idea and for
me this is no-go.

However both AIP-5 and AIP-20 (when discussed, approved and implemented)
should nicely address the user requirement you have, without compromising
the security of the APIs - so I'd heartily recommend you to take a look
there and see if maybe you could take a lead in those discussions and
finalising them. Currently there is no-one actively working on those two
AIPs. but I think there are at least a few people who would like to be
involved if there is someone who will lead this effort (myself included).

J.


On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote:

> Hi Everyone,
>
> I have an enhancement proposal for the REST API service. This is based on
> the observations that Airflow users want to be able to access Airflow more
> easily as a platform service.
>
> The motivation comes from the following use cases:
> 1. Users like data scientists want to iterate over data quickly with
> interactive feedback in minutes, e.g. managing data pipelines inside
> Jupyter Notebook while executing them in a remote airflow cluster.
> 2. Services targeting specific audiences can generate DAGs based on inputs
> like user command or external triggers, and they want to be able to submit
> DAGs programmatically without manual intervention.
>
> I believe such use cases would help promote Airflow usability and gain
> more customer popularity. The existing DAG repo brings considerable
> overhead for such scenarios, a shared repo requires offline processes and
> can be slow to rollout.
>
> The proposal aims to provide an alternative where a DAG can be transmitted
> online and here are some key points:
> 1. A DAG is packaged individually so that it can be distributable over the
> network. For example, a DAG may be a serialized binary or a zip file.
> 2. The Airflow REST API is the ideal place to talk with the external
> world. The API would provide a generic interface to accept DAG artifacts
> and should be extensible to support different artifact formats if needed.
> 3. DAG persistence needs to be implemented since they are not part of the
> DAG repository.
> 4. Same behavior for DAGs supported in API vs those defined in the repo,
> i.e. users write DAGs in the same syntax, and its scheduling, execution,
> and web server UI should behave the same way.
>
> Since DAGs are written as code, running arbitrary code inside Airflow may
> pose high security risks. Here are a few proposals to stop the security
> breach:
> 1. Accept DAGs only from trusted parties. Airflow already supports
> pluggable authentication modules where strong authentication such as
> Kerberos can be used.
> 2. Execute DAG code as the API identity, i.e. A DAG created through the
> API service will have run_as_user set to be the API identity.
> 3. To enforce data access control on DAGs, the API identity should also be
> used to access the data warehouse.
>
> We shared a demo based on a prototype implementation in the summit and
> some details are described in this ppt
> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
> and would love to get feedback and comments from the community about this
> initiative.
>
> thanks
> Mocheng
>

Reply via email to