Re: [Proposal] Creating DAG through the REST api

Nishant Sharma Thu, 25 Aug 2022 08:42:26 -0700

Hi,
I have also felt the need at times for creating DAG's through the REST API.
But I understand the security concerns associated with such implementation.


If not submission through  REST API at least some sort of *development mode*
in airflow interface to create and edit dag's for a user session. Not sure
if this was brought up previously.

Thanks,
Nishant

On Thu, Aug 25, 2022 at 6:32 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Just in case - please watch the devlist for the announcement of the "SIG
> multitenancy" group if it slips my mind.
>
> On Thu, Aug 25, 2022 at 1:31 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Cool. I will make sure to include you ! I think this is something that
>> will happen in September, The holiday period is not the best to organize it.
>>
>> On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <gmca...@gmail.com> wrote:
>>
>>> My use case needs automation and security: those are the two key
>>> requirements and does not have to be REST API if there is another way that
>>> DAGs could be submitted to a cloud storage securely. Sure I would
>>> appreciate it if you could include me when organizing AIP-1 related
>>> meetings. Kerberos is a ticket based system in which a ticket has a limited
>>> lifetime. Using kerberos, a workload could be authenticated before
>>> persistence so that Airflow uses its kerberos keytab to execute, which is
>>> similar to the current implementation in worker, another possible scenarios
>>> is a persisted workload needs to include a kerberos renewable TGT to be
>>> used by Airflow worker, but this is more complex and I would be happy to
>>> discuss more in meetings. I will draft a more detailed document for review.
>>>
>>> thanks
>>> Mocheng
>>>
>>>
>>> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> None of those requirements are supported by Airflow. And opening REST
>>>> API does not solve the authentication use case you mentioned.
>>>>
>>>> This is a completely new requirement you have - basically what you want
>>>> is workflow identity and it should be rather independent from the way DAG
>>>> is submitted. It would require to attach some kind of identity and
>>>> signaturea and some way of making sure that the DAG has not been tampered
>>>> with, in a way that the worker could use the identity when executing the
>>>> workload and be sure that no-one else modified the DAG - including any of
>>>> the files that the DAG uses. This is an interesting case but it has nothing
>>>> to do with using or not the REST API. REST API alone will not give you the
>>>> user identity guarantees that you need here. The distributed nature of
>>>> Airflow basically requires such workflow identity has to be provided by
>>>> cryptographic signatures and verifying the integrity of the DAG rather than
>>>> basing it on REST API authentication.
>>>>
>>>> BTW. We do support already Kerberos authentication for some of our
>>>> operators but identity is necessarily per instance executing the workload -
>>>> not the user submitting the DAG.
>>>>
>>>> This could be one of the improvement proposals that could in the future
>>>> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>>>> interested in leading and proposing such an AIP i will be soon (a month or
>>>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>>>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>>>> approved there (and AIP-43 close to completion) and the next steps should
>>>> be introducing fine graines security layer to executing the workloads.
>>>> Adding workload identity might be part of it. If you would like to work on
>>>> that - you are most welcome. It means to prepare and discuss proposals, get
>>>> consensus of involved parties, leading it to a vote and finally
>>>> implementing it.
>>>>
>>>> J
>>>>
>>>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gmca...@gmail.com>
>>>> napisał:
>>>>
>>>>> >> Could you please elaborate why this would be a problem to use those
>>>>> (really good for file pushing) APIs ?
>>>>>
>>>>> Submitting DAGs directly to cloud storage API does help for some part
>>>>> of the use case requirement, but cloud storage does not provide the
>>>>> security a data warehouse needs. A typical auth model supported in data
>>>>> warehouse is Kerberos, and a data warehouse provides limited view to a
>>>>> kerberos user with authorization rules. We need users to submit DAGs with
>>>>> identities supported by the data warehouse, so that Apache Spark jobs will
>>>>> be executed as the kerberos user who submits a DAG which in turns decide
>>>>> what data can be processed, there may also be need to handle 
>>>>> impersonation,
>>>>> so there needs to be an additional layer to handle data warehouse auth 
>>>>> e.g.
>>>>> kerberos.
>>>>>
>>>>> Assuming dags are already inside the cloud storage, and I think
>>>>> AIP-5/20 would work better than the current mono repo model if it could
>>>>> support better flexibility and less latency, and I would be very 
>>>>> interested
>>>>> to be part of the design and implementation.
>>>>>
>>>>>
>>>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com>
>>>>> wrote:
>>>>>
>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>> has to accept code, both Tomasz and Constance's examples let me think 
>>>>>> that
>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>> example, say another service provides an API which can be invoked by its
>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>> knowledge how DAGs are generated, in such cases the API service can 
>>>>>> access
>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>> instead of each building its own solution.
>>>>>>
>>>>>> > You seem to repeat more of the same. This is exactly what we want
>>>>>> to avoid. IF you can push a code over API you can push Any Code. And
>>>>>> precisely the "Access Control" you mentioned or rejecting the call when
>>>>>> "considering code unsafe" those are the decisions we already deliberately
>>>>>> decided we do not want Airflow REST API to make. Whether the code it's
>>>>>> generated or not does not matter because Airflow has no idea whatsoever 
>>>>>> if
>>>>>> it has been manipulated with, between the time it was generated and 
>>>>>> pushed.
>>>>>> The only way Airflow can know that the code is not manipulated with is 
>>>>>> when
>>>>>> it generates DAG code on its own based on a declarative input. The limit 
>>>>>> is
>>>>>> to push declarative information only. You CANNOT push code via the REST
>>>>>> API. This is out of the question. The case is closed.
>>>>>>
>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>> change data/features used by model frequently which in turn leads to
>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are 
>>>>>> slow
>>>>>> offline processes with large repo.
>>>>>>
>>>>>> Case 2 is actually (if you attempt to read my article I posted above,
>>>>>> it's written there) the case where shared volume could still be used and
>>>>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>>>>> solutions because your "middle" environment does not have to have git 
>>>>>> sync
>>>>>> as it is not "production' (unless you want to mix development with 
>>>>>> testing
>>>>>> that is, which is terrible, terrible idea).
>>>>>>
>>>>>> Your data science for middle ground does:
>>>>>>
>>>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if
>>>>>> you use shared volume of some sort (NFS/EFS etc.)
>>>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>>>>> are on S3  and synced using s3-sync
>>>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>>>>> synced using s3-sync
>>>>>>
>>>>>> Those are excellent "File push" apis. They do the job. I cannot
>>>>>> imagine why the middle-loop person might have a problem with using them.
>>>>>> All of that can also be  fully automated -  they all have nice Python and
>>>>>> other language APIs so you can even make the IDE run those commands
>>>>>> automatically on every save if you want.
>>>>>>
>>>>>> Could you please elaborate why this would be a problem to use those
>>>>>> (really good for file pushing) APIs ?
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gmca...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>>> has to accept code, both Tomasz and Constance's examples let me think 
>>>>>>> that
>>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>>> example, say another service provides an API which can be invoked by its
>>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>>> knowledge how DAGs are generated, in such cases the API service can 
>>>>>>> access
>>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>>> interest, having an Airflow native write path would benefit the 
>>>>>>> community
>>>>>>> instead of each building its own solution.
>>>>>>>
>>>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use
>>>>>>> case, here are three loops a data scientist is doing to develop a 
>>>>>>> machine
>>>>>>> learning model:
>>>>>>> - inner loop: iterates on the model locally.
>>>>>>> - middle loop: iterate the model on a remote cluster with production
>>>>>>> data, say it's using Airflow DAGs behind the scenes.
>>>>>>> - outer loop: done with iteration and publish the model on
>>>>>>> production.
>>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>>> change data/features used by model frequently which in turn leads to
>>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are 
>>>>>>> slow
>>>>>>> offline processes with large repo.
>>>>>>>
>>>>>>> Such use case is pretty common for data scientists, and a better
>>>>>>> **online** service model would help open up more possibilities for 
>>>>>>> Airflow
>>>>>>> and its users, as additional layers providing more values(like Constance
>>>>>>> mentioned enable users with no engineering or airflow domain knowledge 
>>>>>>> to
>>>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>>>> level orchestration engine.
>>>>>>>
>>>>>>> thanks
>>>>>>> Mocheng
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I really like the Idea of Tomek.
>>>>>>>>
>>>>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>>>>> way of describing DAGs, all my security, packaging concerns are gone - 
>>>>>>>> and
>>>>>>>> submitting such declarative DAG via API is quite viable. Simply 
>>>>>>>> submitting
>>>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG 
>>>>>>>> could be
>>>>>>>> just stored in the DB and scheduled and executed using only 
>>>>>>>> "declaration"
>>>>>>>> from the DB - without ever touching the DAG "folder" and without 
>>>>>>>> allowing
>>>>>>>> the user to submit any executable code this way. All the code to 
>>>>>>>> execute
>>>>>>>> would already have to be in Airflow already in this case.
>>>>>>>>
>>>>>>>> And I very much agree also that this case can be solved with Git. I
>>>>>>>> think we are generally undervaluing the role Git plays for DAG 
>>>>>>>> distribution
>>>>>>>> of Airflow.
>>>>>>>>
>>>>>>>> I think when the user feels the need (I very much understand the
>>>>>>>> need Constance) to submit the DAG via API,  rather than adding the 
>>>>>>>> option
>>>>>>>> of submitting the DAG code via "Airflow REST API", we should simply 
>>>>>>>> answer
>>>>>>>> this:
>>>>>>>>
>>>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>>>> "API" you wanted to push the code.*
>>>>>>>>
>>>>>>>> This has all the flexibility you need, it has integration with Pull
>>>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use 
>>>>>>>> Git"
>>>>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>>>>> giants.
>>>>>>>> If we start thinking about integration of code push via our own API
>>>>>>>> - we basically start the journey of rewriting Git as eventually we will
>>>>>>>> have to support those cases. This makes absolutely no sense for me.
>>>>>>>>
>>>>>>>> I even start to think that we should make "git sync" a separate
>>>>>>>> (and much more viable) option that is pretty much the "main 
>>>>>>>> recommendation"
>>>>>>>> for Airflow. rather than "yet another option among shared folders and 
>>>>>>>> baked
>>>>>>>> in DAGs" case.
>>>>>>>>
>>>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>>>> which has much more details on why I think so.
>>>>>>>>
>>>>>>>> J.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>>>> <consta...@astronomer.io.invalid> wrote:
>>>>>>>>
>>>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>>>> regular user I always wished we could upload DAG files via an API. It 
>>>>>>>>> opens
>>>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>>>
>>>>>>>>> I love the idea of implementing a manual review option in
>>>>>>>>> conjunction with some sort of hook (similar to Airflow cluster 
>>>>>>>>> policies)
>>>>>>>>> would be a good middle ground. An administrator could use that hook 
>>>>>>>>> to do
>>>>>>>>> checks against DAGs or run security scanners, and decide whether or 
>>>>>>>>> not to
>>>>>>>>> implement a review requirement.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>>>> turbas...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>>>>> DAG files over API and the security issues arising from accepting 
>>>>>>>>>> "any
>>>>>>>>>> python file" are probably quite big.
>>>>>>>>>>
>>>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would 
>>>>>>>>>> send the
>>>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>>>> that is not a code. This of course has some limitations like 
>>>>>>>>>> inability to
>>>>>>>>>> define custom macros, callbacks on the fly but it may be a good 
>>>>>>>>>> compromise.
>>>>>>>>>>
>>>>>>>>>> Other thought - if we implement something like "DAG via API" then
>>>>>>>>>> we should consider adding an option to review DAGs (approval queue 
>>>>>>>>>> etc) to
>>>>>>>>>> reduce security issues that are mitigated by for example deploying 
>>>>>>>>>> DAGs
>>>>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Tomek
>>>>>>>>>>
>>>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Mocheng,
>>>>>>>>>>>
>>>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the 
>>>>>>>>>>> payload (just
>>>>>>>>>>> binarized or compressed), right?
>>>>>>>>>>>
>>>>>>>>>>> If that's the case, I may not be fully convinced: the objectives
>>>>>>>>>>> in your proposal is about automation & programmatically submitting 
>>>>>>>>>>> DAGs.
>>>>>>>>>>> These can already be achieved in an efficient way through CI/CD 
>>>>>>>>>>> practice +
>>>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host 
>>>>>>>>>>> the DAG
>>>>>>>>>>> files).
>>>>>>>>>>>
>>>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>>>
>>>>>>>>>>> Kindly let me know if I have missed anything or misunderstood
>>>>>>>>>>> your proposal. Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> XD
>>>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>>>> (This is not a contribution)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I have an enhancement proposal for the REST API service. This
>>>>>>>>>>>> is based on the observations that Airflow users want to be able to 
>>>>>>>>>>>> access
>>>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>>>
>>>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines 
>>>>>>>>>>>> inside
>>>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs
>>>>>>>>>>>> based on inputs like user command or external triggers, and they 
>>>>>>>>>>>> want to be
>>>>>>>>>>>> able to submit DAGs programmatically without manual intervention.
>>>>>>>>>>>>
>>>>>>>>>>>> I believe such use cases would help promote Airflow usability
>>>>>>>>>>>> and gain more customer popularity. The existing DAG repo brings
>>>>>>>>>>>> considerable overhead for such scenarios, a shared repo requires 
>>>>>>>>>>>> offline
>>>>>>>>>>>> processes and can be slow to rollout.
>>>>>>>>>>>>
>>>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>>>> 1. A DAG is packaged individually so that it can be
>>>>>>>>>>>> distributable over the network. For example, a DAG may be a 
>>>>>>>>>>>> serialized
>>>>>>>>>>>> binary or a zip file.
>>>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>>>> external world. The API would provide a generic interface to 
>>>>>>>>>>>> accept DAG
>>>>>>>>>>>> artifacts and should be extensible to support different artifact 
>>>>>>>>>>>> formats if
>>>>>>>>>>>> needed.
>>>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>>>> part of the DAG repository.
>>>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its 
>>>>>>>>>>>> scheduling,
>>>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>>>
>>>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to 
>>>>>>>>>>>> stop the
>>>>>>>>>>>> security breach:
>>>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>>>> supports pluggable authentication modules where strong 
>>>>>>>>>>>> authentication such
>>>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>>>> through the API service will have run_as_user set to be the API 
>>>>>>>>>>>> identity.
>>>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>>>
>>>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>>>> and would love to get feedback and comments from the community 
>>>>>>>>>>>> about this
>>>>>>>>>>>> initiative.
>>>>>>>>>>>>
>>>>>>>>>>>> thanks
>>>>>>>>>>>> Mocheng
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Constance Martineau
>>>>>>>>> Product Manager
>>>>>>>>>
>>>>>>>>> Email: consta...@astronomer.io
>>>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>>
>>>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Reply via email to