Re: [Proposal] Creating DAG through the REST api

Jarek Potiuk Thu, 18 Aug 2022 01:20:02 -0700

None of those requirements are supported by Airflow. And opening REST API
does not solve the authentication use case you mentioned.


This is a completely new requirement you have - basically what you want is
workflow identity and it should be rather independent from the way DAG is
submitted. It would require to attach some kind of identity and signaturea
and some way of making sure that the DAG has not been tampered with, in a
way that the worker could use the identity when executing the workload and
be sure that no-one else modified the DAG - including any of the files that
the DAG uses. This is an interesting case but it has nothing to do with
using or not the REST API. REST API alone will not give you the user
identity guarantees that you need here. The distributed nature of Airflow
basically requires such workflow identity has to be provided by
cryptographic signatures and verifying the integrity of the DAG rather than
basing it on REST API authentication.

BTW. We do support already Kerberos authentication for some of our
operators but identity is necessarily per instance executing the workload -
not the user submitting the DAG.

This could be one of the improvement proposals that could in the future
become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
interested in leading and proposing such an AIP i will be soon (a month or
so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
and minutes of previous meetings). We already have AiP-43 and AIP-44
approved there (and AIP-43 close to completion) and the next steps should
be introducing fine graines security layer to executing the workloads.
Adding workload identity might be part of it. If you would like to work on
that - you are most welcome. It means to prepare and discuss proposals, get
consensus of involved parties, leading it to a vote and finally
implementing it.

J

czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gmca...@gmail.com> napisał:

> >> Could you please elaborate why this would be a problem to use those
> (really good for file pushing) APIs ?
>
> Submitting DAGs directly to cloud storage API does help for some part of
> the use case requirement, but cloud storage does not provide the security a
> data warehouse needs. A typical auth model supported in data warehouse is
> Kerberos, and a data warehouse provides limited view to a kerberos user
> with authorization rules. We need users to submit DAGs with identities
> supported by the data warehouse, so that Apache Spark jobs will be executed
> as the kerberos user who submits a DAG which in turns decide what data can
> be processed, there may also be need to handle impersonation, so there
> needs to be an additional layer to handle data warehouse auth e.g.
> kerberos.
>
> Assuming dags are already inside the cloud storage, and I think AIP-5/20
> would work better than the current mono repo model if it could support
> better flexibility and less latency, and I would be very interested to be
> part of the design and implementation.
>
>
> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> First appreciate all for your valuable feedback. Airflow by design has to
>> accept code, both Tomasz and Constance's examples let me think that the
>> security judgement should be on the actual DAGs rather than how DAGs are
>> accepted or a process itself. To expand a little bit more on another
>> example, say another service provides an API which can be invoked by its
>> clients the service validates user inputs e.g. SQL and generates Airflow
>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>> pushed through the API. There are certainly cases that DAGs may not be
>> safe, e.g the API service on public cloud with shared tenants with no
>> knowledge how DAGs are generated, in such cases the API service can access
>> control the identity or even reject all calls when considered unsafe.
>> Please let me know if the example makes sense, and if there is a common
>> interest, having an Airflow native write path would benefit the community
>> instead of each building its own solution.
>>
>> > You seem to repeat more of the same. This is exactly what we want to
>> avoid. IF you can push a code over API you can push Any Code. And precisely
>> the "Access Control" you mentioned or rejecting the call when "considering
>> code unsafe" those are the decisions we already deliberately decided we do
>> not want Airflow REST API to make. Whether the code it's generated or not
>> does not matter because Airflow has no idea whatsoever if it has been
>> manipulated with, between the time it was generated and pushed. The only
>> way Airflow can know that the code is not manipulated with is when it
>> generates DAG code on its own based on a declarative input. The limit is to
>> push declarative information only. You CANNOT push code via the REST API.
>> This is out of the question. The case is closed.
>>
>> The middle loop usually happens on a Jupyter notebook, it needs to change
>> data/features used by model frequently which in turn leads to Airflow DAG
>> updates, do you mind elaborate how to automate the changes inside a
>> notebook and programmatically submitting DAGs through git+CI/CD while
>> giving user quick feedback? I understand git+ci/cd is technically possible
>> but the overhead involved is a major reason users rejecting Airflow for
>> other alternative solutions, e.g. git repo requires manual approval even if
>> DAGs can be programmatically submitted, and CI/CD are slow offline
>> processes with large repo.
>>
>> Case 2 is actually (if you attempt to read my article I posted above,
>> it's written there) the case where shared volume could still be used and
>> are bette. This why it's great that Airflow supports multiple DAG syncing
>> solutions because your "middle" environment does not have to have git sync
>> as it is not "production' (unless you want to mix development with testing
>> that is, which is terrible, terrible idea).
>>
>> Your data science for middle ground does:
>>
>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you
>> use shared volume of some sort (NFS/EFS etc.)
>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags are
>> on S3  and synced using s3-sync
>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>> synced using s3-sync
>>
>> Those are excellent "File push" apis. They do the job. I cannot imagine
>> why the middle-loop person might have a problem with using them. All of
>> that can also be  fully automated -  they all have nice Python and
>> other language APIs so you can even make the IDE run those commands
>> automatically on every save if you want.
>>
>> Could you please elaborate why this would be a problem to use those
>> (really good for file pushing) APIs ?
>>
>> J.
>>
>>
>>
>>
>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gmca...@gmail.com> wrote:
>>
>>> First appreciate all for your valuable feedback. Airflow by design has
>>> to accept code, both Tomasz and Constance's examples let me think that the
>>> security judgement should be on the actual DAGs rather than how DAGs are
>>> accepted or a process itself. To expand a little bit more on another
>>> example, say another service provides an API which can be invoked by its
>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>> pushed through the API. There are certainly cases that DAGs may not be
>>> safe, e.g the API service on public cloud with shared tenants with no
>>> knowledge how DAGs are generated, in such cases the API service can access
>>> control the identity or even reject all calls when considered unsafe.
>>> Please let me know if the example makes sense, and if there is a common
>>> interest, having an Airflow native write path would benefit the community
>>> instead of each building its own solution.
>>>
>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
>>> here are three loops a data scientist is doing to develop a machine
>>> learning model:
>>> - inner loop: iterates on the model locally.
>>> - middle loop: iterate the model on a remote cluster with production
>>> data, say it's using Airflow DAGs behind the scenes.
>>> - outer loop: done with iteration and publish the model on production.
>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>> change data/features used by model frequently which in turn leads to
>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>> while giving user quick feedback? I understand git+ci/cd is technically
>>> possible but the overhead involved is a major reason users rejecting
>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>> offline processes with large repo.
>>>
>>> Such use case is pretty common for data scientists, and a better
>>> **online** service model would help open up more possibilities for Airflow
>>> and its users, as additional layers providing more values(like Constance
>>> mentioned enable users with no engineering or airflow domain knowledge to
>>> use Airflow) could be built on top of Airflow which remains as a lower
>>> level orchestration engine.
>>>
>>> thanks
>>> Mocheng
>>>
>>>
>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> I really like the Idea of Tomek.
>>>>
>>>> If we ever go (which is not unlikely) - some "standard" declarative way
>>>> of describing DAGs, all my security, packaging concerns are gone - and
>>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>> the user to submit any executable code this way. All the code to execute
>>>> would already have to be in Airflow already in this case.
>>>>
>>>> And I very much agree also that this case can be solved with Git. I
>>>> think we are generally undervaluing the role Git plays for DAG distribution
>>>> of Airflow.
>>>>
>>>> I think when the user feels the need (I very much understand the need
>>>> Constance) to submit the DAG via API,  rather than adding the option of
>>>> submitting the DAG code via "Airflow REST API", we should simply answer
>>>> this:
>>>>
>>>> *Use Git and git sync. Then "Git Push" then becomes the standard "API"
>>>> you wanted to push the code.*
>>>>
>>>> This has all the flexibility you need, it has integration with Pull
>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>> giants.
>>>> If we start thinking about integration of code push via our own API -
>>>> we basically start the journey of rewriting Git as eventually we will have
>>>> to support those cases. This makes absolutely no sense for me.
>>>>
>>>> I even start to think that we should make "git sync" a separate (and
>>>> much more viable) option that is pretty much the "main recommendation" for
>>>> Airflow. rather than "yet another option among shared folders and baked in
>>>> DAGs" case.
>>>>
>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>> which has much more details on why I think so.
>>>>
>>>> J.
>>>>
>>>>
>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>> <consta...@astronomer.io.invalid> wrote:
>>>>
>>>>> I understand the security concerns, and generally agree, but as a
>>>>> regular user I always wished we could upload DAG files via an API. It 
>>>>> opens
>>>>> the door to have an upload button, which would be nice. It would make
>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>
>>>>> I love the idea of implementing a manual review option in conjunction
>>>>> with some sort of hook (similar to Airflow cluster policies) would be a
>>>>> good middle ground. An administrator could use that hook to do checks
>>>>> against DAGs or run security scanners, and decide whether or not to
>>>>> implement a review requirement.
>>>>>
>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <turbas...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> In general I second what XD said. CI/CD feels better than sending DAG
>>>>>> files over API and the security issues arising from accepting "any python
>>>>>> file" are probably quite big.
>>>>>>
>>>>>> However, I think this proposal can be tightly related to "declarative
>>>>>> DAGs". Instead of sending a DAG file, the user would send the DAG
>>>>>> definition (operators, inputs, relations) in a predefined format that is
>>>>>> not a code. This of course has some limitations like inability to define
>>>>>> custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>
>>>>>> Other thought - if we implement something like "DAG via API" then we
>>>>>> should consider adding an option to review DAGs (approval queue etc) to
>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>> from git (where we have code review, security scanners etc).
>>>>>>
>>>>>> Cheers,
>>>>>> Tomek
>>>>>>
>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mocheng,
>>>>>>>
>>>>>>> Please allow me to share a question first: so in your proposal, the
>>>>>>> API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>> binarized or compressed), right?
>>>>>>>
>>>>>>> If that's the case, I may not be fully convinced: the objectives in
>>>>>>> your proposal is about automation & programmatically submitting DAGs. 
>>>>>>> These
>>>>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>>> files).
>>>>>>>
>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>
>>>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>>>> proposal. Thanks.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> XD
>>>>>>> ----------------------------------------------------------------
>>>>>>> (This is not a contribution)
>>>>>>>
>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>> based on the observations that Airflow users want to be able to access
>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>
>>>>>>>> The motivation comes from the following use cases:
>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines 
>>>>>>>> inside
>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>>>>>> inputs like user command or external triggers, and they want to be 
>>>>>>>> able to
>>>>>>>> submit DAGs programmatically without manual intervention.
>>>>>>>>
>>>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>>>> gain more customer popularity. The existing DAG repo brings 
>>>>>>>> considerable
>>>>>>>> overhead for such scenarios, a shared repo requires offline processes 
>>>>>>>> and
>>>>>>>> can be slow to rollout.
>>>>>>>>
>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>> transmitted online and here are some key points:
>>>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>>>> over the network. For example, a DAG may be a serialized binary or a 
>>>>>>>> zip
>>>>>>>> file.
>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>> artifacts and should be extensible to support different artifact 
>>>>>>>> formats if
>>>>>>>> needed.
>>>>>>>> 3. DAG persistence needs to be implemented since they are not part
>>>>>>>> of the DAG repository.
>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>
>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop 
>>>>>>>> the
>>>>>>>> security breach:
>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>>>>> pluggable authentication modules where strong authentication such as
>>>>>>>> Kerberos can be used.
>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through
>>>>>>>> the API service will have run_as_user set to be the API identity.
>>>>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>>>>> also be used to access the data warehouse.
>>>>>>>>
>>>>>>>> We shared a demo based on a prototype implementation in the summit
>>>>>>>> and some details are described in this ppt
>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>> and would love to get feedback and comments from the community about 
>>>>>>>> this
>>>>>>>> initiative.
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> Mocheng
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Constance Martineau
>>>>> Product Manager
>>>>>
>>>>> Email: consta...@astronomer.io
>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>
>>>>>
>>>>> <https://www.astronomer.io/>
>>>>>
>>>>>

Re: [Proposal] Creating DAG through the REST api

Reply via email to