Re: [Proposal] Creating DAG through the REST api

Jarek Potiuk Thu, 25 Aug 2022 04:31:32 -0700

Cool. I will make sure to include you ! I think this is something that will
happen in September, The holiday period is not the best to organize it.


On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <[email protected]> wrote:

> My use case needs automation and security: those are the two key
> requirements and does not have to be REST API if there is another way that
> DAGs could be submitted to a cloud storage securely. Sure I would
> appreciate it if you could include me when organizing AIP-1 related
> meetings. Kerberos is a ticket based system in which a ticket has a limited
> lifetime. Using kerberos, a workload could be authenticated before
> persistence so that Airflow uses its kerberos keytab to execute, which is
> similar to the current implementation in worker, another possible scenarios
> is a persisted workload needs to include a kerberos renewable TGT to be
> used by Airflow worker, but this is more complex and I would be happy to
> discuss more in meetings. I will draft a more detailed document for review.
>
> thanks
> Mocheng
>
>
> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <[email protected]> wrote:
>
>> None of those requirements are supported by Airflow. And opening REST API
>> does not solve the authentication use case you mentioned.
>>
>> This is a completely new requirement you have - basically what you want
>> is workflow identity and it should be rather independent from the way DAG
>> is submitted. It would require to attach some kind of identity and
>> signaturea and some way of making sure that the DAG has not been tampered
>> with, in a way that the worker could use the identity when executing the
>> workload and be sure that no-one else modified the DAG - including any of
>> the files that the DAG uses. This is an interesting case but it has nothing
>> to do with using or not the REST API. REST API alone will not give you the
>> user identity guarantees that you need here. The distributed nature of
>> Airflow basically requires such workflow identity has to be provided by
>> cryptographic signatures and verifying the integrity of the DAG rather than
>> basing it on REST API authentication.
>>
>> BTW. We do support already Kerberos authentication for some of our
>> operators but identity is necessarily per instance executing the workload -
>> not the user submitting the DAG.
>>
>> This could be one of the improvement proposals that could in the future
>> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>> interested in leading and proposing such an AIP i will be soon (a month or
>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>> approved there (and AIP-43 close to completion) and the next steps should
>> be introducing fine graines security layer to executing the workloads.
>> Adding workload identity might be part of it. If you would like to work on
>> that - you are most welcome. It means to prepare and discuss proposals, get
>> consensus of involved parties, leading it to a vote and finally
>> implementing it.
>>
>> J
>>
>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <[email protected]>
>> napisał:
>>
>>> >> Could you please elaborate why this would be a problem to use those
>>> (really good for file pushing) APIs ?
>>>
>>> Submitting DAGs directly to cloud storage API does help for some part of
>>> the use case requirement, but cloud storage does not provide the security a
>>> data warehouse needs. A typical auth model supported in data warehouse is
>>> Kerberos, and a data warehouse provides limited view to a kerberos user
>>> with authorization rules. We need users to submit DAGs with identities
>>> supported by the data warehouse, so that Apache Spark jobs will be executed
>>> as the kerberos user who submits a DAG which in turns decide what data can
>>> be processed, there may also be need to handle impersonation, so there
>>> needs to be an additional layer to handle data warehouse auth e.g.
>>> kerberos.
>>>
>>> Assuming dags are already inside the cloud storage, and I think AIP-5/20
>>> would work better than the current mono repo model if it could support
>>> better flexibility and less latency, and I would be very interested to be
>>> part of the design and implementation.
>>>
>>>
>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>> accepted or a process itself. To expand a little bit more on another
>>>> example, say another service provides an API which can be invoked by its
>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>> control the identity or even reject all calls when considered unsafe.
>>>> Please let me know if the example makes sense, and if there is a common
>>>> interest, having an Airflow native write path would benefit the community
>>>> instead of each building its own solution.
>>>>
>>>> > You seem to repeat more of the same. This is exactly what we want to
>>>> avoid. IF you can push a code over API you can push Any Code. And precisely
>>>> the "Access Control" you mentioned or rejecting the call when "considering
>>>> code unsafe" those are the decisions we already deliberately decided we do
>>>> not want Airflow REST API to make. Whether the code it's generated or not
>>>> does not matter because Airflow has no idea whatsoever if it has been
>>>> manipulated with, between the time it was generated and pushed. The only
>>>> way Airflow can know that the code is not manipulated with is when it
>>>> generates DAG code on its own based on a declarative input. The limit is to
>>>> push declarative information only. You CANNOT push code via the REST API.
>>>> This is out of the question. The case is closed.
>>>>
>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>> change data/features used by model frequently which in turn leads to
>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>> possible but the overhead involved is a major reason users rejecting
>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>> offline processes with large repo.
>>>>
>>>> Case 2 is actually (if you attempt to read my article I posted above,
>>>> it's written there) the case where shared volume could still be used and
>>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>>> solutions because your "middle" environment does not have to have git sync
>>>> as it is not "production' (unless you want to mix development with testing
>>>> that is, which is terrible, terrible idea).
>>>>
>>>> Your data science for middle ground does:
>>>>
>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you
>>>> use shared volume of some sort (NFS/EFS etc.)
>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>>> are on S3  and synced using s3-sync
>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>>> synced using s3-sync
>>>>
>>>> Those are excellent "File push" apis. They do the job. I cannot imagine
>>>> why the middle-loop person might have a problem with using them. All of
>>>> that can also be  fully automated -  they all have nice Python and
>>>> other language APIs so you can even make the IDE run those commands
>>>> automatically on every save if you want.
>>>>
>>>> Could you please elaborate why this would be a problem to use those
>>>> (really good for file pushing) APIs ?
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <[email protected]> wrote:
>>>>
>>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>>> accepted or a process itself. To expand a little bit more on another
>>>>> example, say another service provides an API which can be invoked by its
>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>> control the identity or even reject all calls when considered unsafe.
>>>>> Please let me know if the example makes sense, and if there is a common
>>>>> interest, having an Airflow native write path would benefit the community
>>>>> instead of each building its own solution.
>>>>>
>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
>>>>> here are three loops a data scientist is doing to develop a machine
>>>>> learning model:
>>>>> - inner loop: iterates on the model locally.
>>>>> - middle loop: iterate the model on a remote cluster with production
>>>>> data, say it's using Airflow DAGs behind the scenes.
>>>>> - outer loop: done with iteration and publish the model on production.
>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>> change data/features used by model frequently which in turn leads to
>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>> possible but the overhead involved is a major reason users rejecting
>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are 
>>>>> slow
>>>>> offline processes with large repo.
>>>>>
>>>>> Such use case is pretty common for data scientists, and a better
>>>>> **online** service model would help open up more possibilities for Airflow
>>>>> and its users, as additional layers providing more values(like Constance
>>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>> level orchestration engine.
>>>>>
>>>>> thanks
>>>>> Mocheng
>>>>>
>>>>>
>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I really like the Idea of Tomek.
>>>>>>
>>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>>> way of describing DAGs, all my security, packaging concerns are gone - 
>>>>>> and
>>>>>> submitting such declarative DAG via API is quite viable. Simply 
>>>>>> submitting
>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG could 
>>>>>> be
>>>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>>>> the user to submit any executable code this way. All the code to execute
>>>>>> would already have to be in Airflow already in this case.
>>>>>>
>>>>>> And I very much agree also that this case can be solved with Git. I
>>>>>> think we are generally undervaluing the role Git plays for DAG 
>>>>>> distribution
>>>>>> of Airflow.
>>>>>>
>>>>>> I think when the user feels the need (I very much understand the need
>>>>>> Constance) to submit the DAG via API,  rather than adding the option of
>>>>>> submitting the DAG code via "Airflow REST API", we should simply answer
>>>>>> this:
>>>>>>
>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>> "API" you wanted to push the code.*
>>>>>>
>>>>>> This has all the flexibility you need, it has integration with Pull
>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use 
>>>>>> Git"
>>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>>> giants.
>>>>>> If we start thinking about integration of code push via our own API -
>>>>>> we basically start the journey of rewriting Git as eventually we will 
>>>>>> have
>>>>>> to support those cases. This makes absolutely no sense for me.
>>>>>>
>>>>>> I even start to think that we should make "git sync" a separate (and
>>>>>> much more viable) option that is pretty much the "main recommendation" 
>>>>>> for
>>>>>> Airflow. rather than "yet another option among shared folders and baked 
>>>>>> in
>>>>>> DAGs" case.
>>>>>>
>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>> which has much more details on why I think so.
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>> regular user I always wished we could upload DAG files via an API. It 
>>>>>>> opens
>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>
>>>>>>> I love the idea of implementing a manual review option in
>>>>>>> conjunction with some sort of hook (similar to Airflow cluster policies)
>>>>>>> would be a good middle ground. An administrator could use that hook to 
>>>>>>> do
>>>>>>> checks against DAGs or run security scanners, and decide whether or not 
>>>>>>> to
>>>>>>> implement a review requirement.
>>>>>>>
>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>>> DAG files over API and the security issues arising from accepting "any
>>>>>>>> python file" are probably quite big.
>>>>>>>>
>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send 
>>>>>>>> the
>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>> that is not a code. This of course has some limitations like inability 
>>>>>>>> to
>>>>>>>> define custom macros, callbacks on the fly but it may be a good 
>>>>>>>> compromise.
>>>>>>>>
>>>>>>>> Other thought - if we implement something like "DAG via API" then
>>>>>>>> we should consider adding an option to review DAGs (approval queue 
>>>>>>>> etc) to
>>>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Tomek
>>>>>>>>
>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Mocheng,
>>>>>>>>>
>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the payload 
>>>>>>>>> (just
>>>>>>>>> binarized or compressed), right?
>>>>>>>>>
>>>>>>>>> If that's the case, I may not be fully convinced: the objectives
>>>>>>>>> in your proposal is about automation & programmatically submitting 
>>>>>>>>> DAGs.
>>>>>>>>> These can already be achieved in an efficient way through CI/CD 
>>>>>>>>> practice +
>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host the 
>>>>>>>>> DAG
>>>>>>>>> files).
>>>>>>>>>
>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>
>>>>>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>>>>>> proposal. Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> XD
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> (This is not a contribution)
>>>>>>>>>
>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>>>> based on the observations that Airflow users want to be able to 
>>>>>>>>>> access
>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>
>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines 
>>>>>>>>>> inside
>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs based
>>>>>>>>>> on inputs like user command or external triggers, and they want to 
>>>>>>>>>> be able
>>>>>>>>>> to submit DAGs programmatically without manual intervention.
>>>>>>>>>>
>>>>>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>>>>>> gain more customer popularity. The existing DAG repo brings 
>>>>>>>>>> considerable
>>>>>>>>>> overhead for such scenarios, a shared repo requires offline 
>>>>>>>>>> processes and
>>>>>>>>>> can be slow to rollout.
>>>>>>>>>>
>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>>>>>> over the network. For example, a DAG may be a serialized binary or a 
>>>>>>>>>> zip
>>>>>>>>>> file.
>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>> external world. The API would provide a generic interface to accept 
>>>>>>>>>> DAG
>>>>>>>>>> artifacts and should be extensible to support different artifact 
>>>>>>>>>> formats if
>>>>>>>>>> needed.
>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>> part of the DAG repository.
>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its 
>>>>>>>>>> scheduling,
>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>
>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to 
>>>>>>>>>> stop the
>>>>>>>>>> security breach:
>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>> supports pluggable authentication modules where strong 
>>>>>>>>>> authentication such
>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>> through the API service will have run_as_user set to be the API 
>>>>>>>>>> identity.
>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>
>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>> and would love to get feedback and comments from the community about 
>>>>>>>>>> this
>>>>>>>>>> initiative.
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>> Mocheng
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Constance Martineau
>>>>>>> Product Manager
>>>>>>>
>>>>>>> Email: [email protected]
>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>
>>>>>>>
>>>>>>> <https://www.astronomer.io/>
>>>>>>>
>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Reply via email to