Re: [Proposal] Creating DAG through the REST api

Jarek Potiuk Thu, 11 Aug 2022 01:31:38 -0700

> For the security concern, is it true that "access DAG files" means
loading DAG code? If that's correct, the proposal will not introduce it
inside the api/web server, the DAG could be serialized in API client and
DAG code/files through the API would be handled as a blob but it needs to
be persisted and meta data inside DB needs to be updated. For task
execution in worker, it can be better isolated with current internal API
initiative, and I have missed discussion in AIP-5 and maybe you could help
educate me here, what are the security differences between DAGs pushed to
API vs DAGs pulled from remote repositories in AIP-5?

Opening up submission via API changes a lot actually. What you are
essentially proposing is the Airflow API to allow a person  who is
authenticated via Airflow API/webserver to submit a code that will be
executed elsewhere (DagProcessor/Worker). This is not possible today - not
because Authentication and webserver implementation prevents it, but
because when Airflow is deployed, there is no physical possibility to
submit such code. Simply Airflow webserver does not have the possibility to
change the code that is executed because it does not have that code
mounted. The code displayed in webserver is a small subset of the code
(Just DAG) from the DB, but it is never executed - it's not even serialized
blob, it is the source code of the DAG file, that the DAG came from. The
best you can do is to modify the source code in the DB, but there is no
code in the DB that ever gets executed (if your deployment is done well and
you do not allow pickling). Effectively this aspect and the responsibility
and the security perimeter is in the hands of people who do deployment
(i.e. user) not application developers (people who submit the code to
Airflow). The proposal of yours completely changes the responsibility.
Instead of the "users" who are deploying it, this security is now in the
hands of application developers (i.e. people who commit code to Airflow).
We deliberately decided to take the responsibility off our shoulders and
pass them to the users. Your proposal is really an attempt to put the
responsibility back on our shoulders. Webserver is the only component that
is potentially available to the users and it is a security gateway that
should be opened at least to the internal "users" on the internal network.
Opening it up to accept code that will be executed by design is simply not
a good idea IMHO. Only worker and DAG file processor should ever have the
possibility of executing user provided code. This is what we have now. With
AIP-43 we implemented in 2.3 even scheduler does not have to have access to
DAG files nor execute the DAG code. You might have separate
DagFileProcessors to do that.

> Besides security, one difference between AIP-5/AIP-20 and API is that
AIP-5/AIP-20 design is only about reading and does not manage DAG creation
inside Airflow, I understand if this is currently by design to keep Airflow
off the storage responsibility and instead rely on external service/process
to manage and supply DAG repo, but it brings extra complexity, for example,
this external service/process needs to understand Airflow and prevent
duplicate dag_id.

Correct. This makes deployment more complex. And it's a deliberate design
decision. And it will not be changed, I am afraid - precisely for the
reasons described above.

> The API proposal could support it natively with access to the DB, and can
synchronously return status to a client. If this can be alternatively
included inside AIP-5 that'd be great.

Once API-5/API-20 is implemented, you will be able to implement your own
API if you want to submit DAGs this way, no problem with that. While
airflow components might pull the data, this should be completely decoupled
from the public API of Airflow. The public API serves a different purpose,
but there is nothing to prevent you from implementing your own API. In fact
if you look closely this is already happening and it is possible even now
via various deployments. Those are different APIs - deployment specific,
and while there is indeed no "synchronous" triggering of the DAG you can
submit the code even now via various mechanisms:

* Git push (if Airflow workers/dag processors uses GitSync)
* Copy files to a shared volume (if they use NFS/EFS etc)
* Push files to S3/GCS (if S3 /GCS filesystems are user)

Those are APIs -  not REST APIS but they are still APIs  - and all of them
are actually vastly superior when sending a bunch of Python files than REST
API (because this is what they have been designed for). The only problem
(but REST API does not solve it on its own) is a lack of synchronous
waiting for the DAG being eligible to run. This happens asynchronously now
- and this is not something that can be changed easily even if you want to
use REST API.

IMHO - what you are really looking for is a better integrated way of
submitting the DAG and waiting for it to be ready to run. But you do not
need a REST API to submit a code at all for that. And actually naive
implementation of submitting the code via REST API in the current
architecture will not make it "magically" return when the dag is "Ready" to
be run. There are a lot more things happening in the scheduler to make
a DAG ready and just a REST api to submit the code does not solve it at
all.

Maybe a better solution (and that could be part of the DAG fetcher AIP-5)
is to expose some Async/Websocket API where you could subscribe to be
notified when DAG is ready to run? I think AIP-5 is very far from being
complete, so this might be definitely part of it. I encourage you to
propose it there if you think that it might be a good idea. But just don't
ask us to take security responsibility for the code submitted via Airflow's
REST API. This is not something we - I think would like to do (or at least
this is something we recently got rid of responsibility for rather recently
and deliberately).

J.

czw., 11 sie 2022, 03:19 użytkownik Mocheng Guo <gmca...@gmail.com> napisał:

> Hi Jarek, thanks a lot for the feedback and I understand security is a
> major concern and I would like to discuss more here. AIP-5/AIP-20 share the
> same goal to be able to ship DAGs individually but there are some
> differences and I'd be happy to align them together if that is possible.
>
> For the security concern, is it true that "access DAG files" means loading
> DAG code? If that's correct, the proposal will not introduce it inside the
> api/web server, the DAG could be serialized in API client and DAG
> code/files through the API would be handled as a blob but it needs to be
> persisted and meta data inside DB needs to be updated. For task execution
> in worker, it can be better isolated with current internal API initiative,
> and I have missed discussion in AIP-5 and maybe you could help educate me
> here, what are the security differences between DAGs pushed to API vs DAGs
> pulled from remote repositories in AIP-5?
>
> Besides security, one difference between AIP-5/AIP-20 and API is that
> AIP-5/AIP-20 design is only about reading and does not manage DAG creation
> inside Airflow, I understand if this is currently by design to keep Airflow
> off the storage responsibility and instead rely on external service/process
> to manage and supply DAG repo, but it brings extra complexity, for example,
> this external service/process needs to understand Airflow and prevent
> duplicate dag_id. The API proposal could support it natively with access to
> the DB, and can synchronously return status to client. If this can be
> alternatively included inside AIP-5 that'd be great.
>
> thanks
> Mocheng
>
>
> On Wed, Aug 10, 2022 at 5:27 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> This has been discussed several times, and I think you should rather take
>> a look and focus on those proposals already there:
>>
>> * https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest
>>
>> *
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
>>
>> Both proposals are supposed to address various caveats connected with
>> trying to submit Python DAG via API.
>>
>> * DAG manifest was a proposal on how to add meta-data to "limit" dag to
>> know what kind of other resources are needed for it to run
>> * Remote DAG Fetcher on the other hand would allow a user to use various
>> mechanisms to "Pull the data" (or if you develop your fetcher in the way to
>> allow Push, it would also allow Push model to work if we add some async
>> notifications there).
>>
>> I personally think using rest API to submit DAGs is a bad idea because it
>> is against the current security model of Airflow where Webserver (and also
>> REST API) has not only "READ ONLY" access to DAGs, but also has actually
>> "NO ACCESS" to DAGs whatsoever.
>> Currently, the webserver (i.e. API server) has no physical access to any
>> resources where executable DAG code is accessed and can be not only changed
>> but read directly. It only accesses the DB. Changing that is a huge change
>> in the security model. Actually it goes backwards to the changes we've
>> implemented in Airflow 1.10 initially and leaving that as the only option
>> in Airflow 2 (introducing DAG serialisation) where we specifically put a
>> lot of effort to remove the need for the webserver to access DAG files -
>> and security model we chose was the main driver for that.
>>
>> Making it possible to submit a new executable code via REST API of
>> Airflow would significantly increase dangers of exposing the API and make
>> it an order of magnitude more serious point of attack for an attacker.
>> Basically you are allowing the person who has access to API to submit an
>> executable code that should be executable by DAG file processor and worker.
>> Due to this - I don't think using REST API for that is a good idea and
>> for me this is no-go.
>>
>> However both AIP-5 and AIP-20 (when discussed, approved and implemented)
>> should nicely address the user requirement you have, without compromising
>> the security of the APIs - so I'd heartily recommend you to take a look
>> there and see if maybe you could take a lead in those discussions and
>> finalising them. Currently there is no-one actively working on those two
>> AIPs. but I think there are at least a few people who would like to be
>> involved if there is someone who will lead this effort (myself included).
>>
>> J.
>>
>>
>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I have an enhancement proposal for the REST API service. This is based
>>> on the observations that Airflow users want to be able to access Airflow
>>> more easily as a platform service.
>>>
>>> The motivation comes from the following use cases:
>>> 1. Users like data scientists want to iterate over data quickly with
>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>> 2. Services targeting specific audiences can generate DAGs based on
>>> inputs like user command or external triggers, and they want to be able to
>>> submit DAGs programmatically without manual intervention.
>>>
>>> I believe such use cases would help promote Airflow usability and gain
>>> more customer popularity. The existing DAG repo brings considerable
>>> overhead for such scenarios, a shared repo requires offline processes and
>>> can be slow to rollout.
>>>
>>> The proposal aims to provide an alternative where a DAG can be
>>> transmitted online and here are some key points:
>>> 1. A DAG is packaged individually so that it can be distributable over
>>> the network. For example, a DAG may be a serialized binary or a zip file.
>>> 2. The Airflow REST API is the ideal place to talk with the external
>>> world. The API would provide a generic interface to accept DAG artifacts
>>> and should be extensible to support different artifact formats if needed.
>>> 3. DAG persistence needs to be implemented since they are not part of
>>> the DAG repository.
>>> 4. Same behavior for DAGs supported in API vs those defined in the repo,
>>> i.e. users write DAGs in the same syntax, and its scheduling, execution,
>>> and web server UI should behave the same way.
>>>
>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>> may pose high security risks. Here are a few proposals to stop the security
>>> breach:
>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>> pluggable authentication modules where strong authentication such as
>>> Kerberos can be used.
>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>>> API service will have run_as_user set to be the API identity.
>>> 3. To enforce data access control on DAGs, the API identity should also
>>> be used to access the data warehouse.
>>>
>>> We shared a demo based on a prototype implementation in the summit and
>>> some details are described in this ppt
>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>> and would love to get feedback and comments from the community about this
>>> initiative.
>>>
>>> thanks
>>> Mocheng
>>>
>>

Re: [Proposal] Creating DAG through the REST api

Reply via email to