Re: [Proposal] Creating DAG through the REST api

Jarek Potiuk Thu, 25 Aug 2022 04:32:45 -0700

Just in case - please watch the devlist for the announcement of the "SIG
multitenancy" group if it slips my mind.


On Thu, Aug 25, 2022 at 1:31 PM Jarek Potiuk <[email protected]> wrote:

> Cool. I will make sure to include you ! I think this is something that
> will happen in September, The holiday period is not the best to organize it.
>
> On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <[email protected]> wrote:
>
>> My use case needs automation and security: those are the two key
>> requirements and does not have to be REST API if there is another way that
>> DAGs could be submitted to a cloud storage securely. Sure I would
>> appreciate it if you could include me when organizing AIP-1 related
>> meetings. Kerberos is a ticket based system in which a ticket has a limited
>> lifetime. Using kerberos, a workload could be authenticated before
>> persistence so that Airflow uses its kerberos keytab to execute, which is
>> similar to the current implementation in worker, another possible scenarios
>> is a persisted workload needs to include a kerberos renewable TGT to be
>> used by Airflow worker, but this is more complex and I would be happy to
>> discuss more in meetings. I will draft a more detailed document for review.
>>
>> thanks
>> Mocheng
>>
>>
>> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <[email protected]> wrote:
>>
>>> None of those requirements are supported by Airflow. And opening REST
>>> API does not solve the authentication use case you mentioned.
>>>
>>> This is a completely new requirement you have - basically what you want
>>> is workflow identity and it should be rather independent from the way DAG
>>> is submitted. It would require to attach some kind of identity and
>>> signaturea and some way of making sure that the DAG has not been tampered
>>> with, in a way that the worker could use the identity when executing the
>>> workload and be sure that no-one else modified the DAG - including any of
>>> the files that the DAG uses. This is an interesting case but it has nothing
>>> to do with using or not the REST API. REST API alone will not give you the
>>> user identity guarantees that you need here. The distributed nature of
>>> Airflow basically requires such workflow identity has to be provided by
>>> cryptographic signatures and verifying the integrity of the DAG rather than
>>> basing it on REST API authentication.
>>>
>>> BTW. We do support already Kerberos authentication for some of our
>>> operators but identity is necessarily per instance executing the workload -
>>> not the user submitting the DAG.
>>>
>>> This could be one of the improvement proposals that could in the future
>>> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>>> interested in leading and proposing such an AIP i will be soon (a month or
>>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>>> approved there (and AIP-43 close to completion) and the next steps should
>>> be introducing fine graines security layer to executing the workloads.
>>> Adding workload identity might be part of it. If you would like to work on
>>> that - you are most welcome. It means to prepare and discuss proposals, get
>>> consensus of involved parties, leading it to a vote and finally
>>> implementing it.
>>>
>>> J
>>>
>>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <[email protected]>
>>> napisał:
>>>
>>>> >> Could you please elaborate why this would be a problem to use those
>>>> (really good for file pushing) APIs ?
>>>>
>>>> Submitting DAGs directly to cloud storage API does help for some part
>>>> of the use case requirement, but cloud storage does not provide the
>>>> security a data warehouse needs. A typical auth model supported in data
>>>> warehouse is Kerberos, and a data warehouse provides limited view to a
>>>> kerberos user with authorization rules. We need users to submit DAGs with
>>>> identities supported by the data warehouse, so that Apache Spark jobs will
>>>> be executed as the kerberos user who submits a DAG which in turns decide
>>>> what data can be processed, there may also be need to handle impersonation,
>>>> so there needs to be an additional layer to handle data warehouse auth e.g.
>>>> kerberos.
>>>>
>>>> Assuming dags are already inside the cloud storage, and I think
>>>> AIP-5/20 would work better than the current mono repo model if it could
>>>> support better flexibility and less latency, and I would be very interested
>>>> to be part of the design and implementation.
>>>>
>>>>
>>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <[email protected]> wrote:
>>>>
>>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>>> accepted or a process itself. To expand a little bit more on another
>>>>> example, say another service provides an API which can be invoked by its
>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>> control the identity or even reject all calls when considered unsafe.
>>>>> Please let me know if the example makes sense, and if there is a common
>>>>> interest, having an Airflow native write path would benefit the community
>>>>> instead of each building its own solution.
>>>>>
>>>>> > You seem to repeat more of the same. This is exactly what we want to
>>>>> avoid. IF you can push a code over API you can push Any Code. And 
>>>>> precisely
>>>>> the "Access Control" you mentioned or rejecting the call when "considering
>>>>> code unsafe" those are the decisions we already deliberately decided we do
>>>>> not want Airflow REST API to make. Whether the code it's generated or not
>>>>> does not matter because Airflow has no idea whatsoever if it has been
>>>>> manipulated with, between the time it was generated and pushed. The only
>>>>> way Airflow can know that the code is not manipulated with is when it
>>>>> generates DAG code on its own based on a declarative input. The limit is 
>>>>> to
>>>>> push declarative information only. You CANNOT push code via the REST API.
>>>>> This is out of the question. The case is closed.
>>>>>
>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>> change data/features used by model frequently which in turn leads to
>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>> possible but the overhead involved is a major reason users rejecting
>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are 
>>>>> slow
>>>>> offline processes with large repo.
>>>>>
>>>>> Case 2 is actually (if you attempt to read my article I posted above,
>>>>> it's written there) the case where shared volume could still be used and
>>>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>>>> solutions because your "middle" environment does not have to have git sync
>>>>> as it is not "production' (unless you want to mix development with testing
>>>>> that is, which is terrible, terrible idea).
>>>>>
>>>>> Your data science for middle ground does:
>>>>>
>>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if
>>>>> you use shared volume of some sort (NFS/EFS etc.)
>>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>>>> are on S3  and synced using s3-sync
>>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>>>> synced using s3-sync
>>>>>
>>>>> Those are excellent "File push" apis. They do the job. I cannot
>>>>> imagine why the middle-loop person might have a problem with using them.
>>>>> All of that can also be  fully automated -  they all have nice Python and
>>>>> other language APIs so you can even make the IDE run those commands
>>>>> automatically on every save if you want.
>>>>>
>>>>> Could you please elaborate why this would be a problem to use those
>>>>> (really good for file pushing) APIs ?
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <[email protected]> wrote:
>>>>>
>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>> has to accept code, both Tomasz and Constance's examples let me think 
>>>>>> that
>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>> example, say another service provides an API which can be invoked by its
>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>> knowledge how DAGs are generated, in such cases the API service can 
>>>>>> access
>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>> instead of each building its own solution.
>>>>>>
>>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use
>>>>>> case, here are three loops a data scientist is doing to develop a machine
>>>>>> learning model:
>>>>>> - inner loop: iterates on the model locally.
>>>>>> - middle loop: iterate the model on a remote cluster with production
>>>>>> data, say it's using Airflow DAGs behind the scenes.
>>>>>> - outer loop: done with iteration and publish the model on production.
>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>> change data/features used by model frequently which in turn leads to
>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are 
>>>>>> slow
>>>>>> offline processes with large repo.
>>>>>>
>>>>>> Such use case is pretty common for data scientists, and a better
>>>>>> **online** service model would help open up more possibilities for 
>>>>>> Airflow
>>>>>> and its users, as additional layers providing more values(like Constance
>>>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>>> level orchestration engine.
>>>>>>
>>>>>> thanks
>>>>>> Mocheng
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I really like the Idea of Tomek.
>>>>>>>
>>>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>>>> way of describing DAGs, all my security, packaging concerns are gone - 
>>>>>>> and
>>>>>>> submitting such declarative DAG via API is quite viable. Simply 
>>>>>>> submitting
>>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG could 
>>>>>>> be
>>>>>>> just stored in the DB and scheduled and executed using only 
>>>>>>> "declaration"
>>>>>>> from the DB - without ever touching the DAG "folder" and without 
>>>>>>> allowing
>>>>>>> the user to submit any executable code this way. All the code to execute
>>>>>>> would already have to be in Airflow already in this case.
>>>>>>>
>>>>>>> And I very much agree also that this case can be solved with Git. I
>>>>>>> think we are generally undervaluing the role Git plays for DAG 
>>>>>>> distribution
>>>>>>> of Airflow.
>>>>>>>
>>>>>>> I think when the user feels the need (I very much understand the
>>>>>>> need Constance) to submit the DAG via API,  rather than adding the 
>>>>>>> option
>>>>>>> of submitting the DAG code via "Airflow REST API", we should simply 
>>>>>>> answer
>>>>>>> this:
>>>>>>>
>>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>>> "API" you wanted to push the code.*
>>>>>>>
>>>>>>> This has all the flexibility you need, it has integration with Pull
>>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use 
>>>>>>> Git"
>>>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>>>> giants.
>>>>>>> If we start thinking about integration of code push via our own API
>>>>>>> - we basically start the journey of rewriting Git as eventually we will
>>>>>>> have to support those cases. This makes absolutely no sense for me.
>>>>>>>
>>>>>>> I even start to think that we should make "git sync" a separate (and
>>>>>>> much more viable) option that is pretty much the "main recommendation" 
>>>>>>> for
>>>>>>> Airflow. rather than "yet another option among shared folders and baked 
>>>>>>> in
>>>>>>> DAGs" case.
>>>>>>>
>>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>>> which has much more details on why I think so.
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>>> regular user I always wished we could upload DAG files via an API. It 
>>>>>>>> opens
>>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>>
>>>>>>>> I love the idea of implementing a manual review option in
>>>>>>>> conjunction with some sort of hook (similar to Airflow cluster 
>>>>>>>> policies)
>>>>>>>> would be a good middle ground. An administrator could use that hook to 
>>>>>>>> do
>>>>>>>> checks against DAGs or run security scanners, and decide whether or 
>>>>>>>> not to
>>>>>>>> implement a review requirement.
>>>>>>>>
>>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>>>> DAG files over API and the security issues arising from accepting "any
>>>>>>>>> python file" are probably quite big.
>>>>>>>>>
>>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would 
>>>>>>>>> send the
>>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>>> that is not a code. This of course has some limitations like 
>>>>>>>>> inability to
>>>>>>>>> define custom macros, callbacks on the fly but it may be a good 
>>>>>>>>> compromise.
>>>>>>>>>
>>>>>>>>> Other thought - if we implement something like "DAG via API" then
>>>>>>>>> we should consider adding an option to review DAGs (approval queue 
>>>>>>>>> etc) to
>>>>>>>>> reduce security issues that are mitigated by for example deploying 
>>>>>>>>> DAGs
>>>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Tomek
>>>>>>>>>
>>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mocheng,
>>>>>>>>>>
>>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the 
>>>>>>>>>> payload (just
>>>>>>>>>> binarized or compressed), right?
>>>>>>>>>>
>>>>>>>>>> If that's the case, I may not be fully convinced: the objectives
>>>>>>>>>> in your proposal is about automation & programmatically submitting 
>>>>>>>>>> DAGs.
>>>>>>>>>> These can already be achieved in an efficient way through CI/CD 
>>>>>>>>>> practice +
>>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host the 
>>>>>>>>>> DAG
>>>>>>>>>> files).
>>>>>>>>>>
>>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>>
>>>>>>>>>> Kindly let me know if I have missed anything or misunderstood
>>>>>>>>>> your proposal. Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> XD
>>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>>> (This is not a contribution)
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>
>>>>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>>>>> based on the observations that Airflow users want to be able to 
>>>>>>>>>>> access
>>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>>
>>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines 
>>>>>>>>>>> inside
>>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs based
>>>>>>>>>>> on inputs like user command or external triggers, and they want to 
>>>>>>>>>>> be able
>>>>>>>>>>> to submit DAGs programmatically without manual intervention.
>>>>>>>>>>>
>>>>>>>>>>> I believe such use cases would help promote Airflow usability
>>>>>>>>>>> and gain more customer popularity. The existing DAG repo brings
>>>>>>>>>>> considerable overhead for such scenarios, a shared repo requires 
>>>>>>>>>>> offline
>>>>>>>>>>> processes and can be slow to rollout.
>>>>>>>>>>>
>>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>>> 1. A DAG is packaged individually so that it can be
>>>>>>>>>>> distributable over the network. For example, a DAG may be a 
>>>>>>>>>>> serialized
>>>>>>>>>>> binary or a zip file.
>>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>>> external world. The API would provide a generic interface to accept 
>>>>>>>>>>> DAG
>>>>>>>>>>> artifacts and should be extensible to support different artifact 
>>>>>>>>>>> formats if
>>>>>>>>>>> needed.
>>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>>> part of the DAG repository.
>>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its 
>>>>>>>>>>> scheduling,
>>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>>
>>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to 
>>>>>>>>>>> stop the
>>>>>>>>>>> security breach:
>>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>>> supports pluggable authentication modules where strong 
>>>>>>>>>>> authentication such
>>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>>> through the API service will have run_as_user set to be the API 
>>>>>>>>>>> identity.
>>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>>
>>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>>> and would love to get feedback and comments from the community 
>>>>>>>>>>> about this
>>>>>>>>>>> initiative.
>>>>>>>>>>>
>>>>>>>>>>> thanks
>>>>>>>>>>> Mocheng
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Constance Martineau
>>>>>>>> Product Manager
>>>>>>>>
>>>>>>>> Email: [email protected]
>>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>>
>>>>>>>>
>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>
>>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Reply via email to