Re: [Proposal] Creating DAG through the REST api

Jarek Potiuk Fri, 12 Aug 2022 10:56:44 -0700

First appreciate all for your valuable feedback. Airflow by design has to
accept code, both Tomasz and Constance's examples let me think that the
security judgement should be on the actual DAGs rather than how DAGs are
accepted or a process itself. To expand a little bit more on another
example, say another service provides an API which can be invoked by its
clients the service validates user inputs e.g. SQL and generates Airflow
DAGs which use the validated operators/macros. Those DAGs are safe to be
pushed through the API. There are certainly cases that DAGs may not be
safe, e.g the API service on public cloud with shared tenants with no
knowledge how DAGs are generated, in such cases the API service can access
control the identity or even reject all calls when considered unsafe.
Please let me know if the example makes sense, and if there is a common
interest, having an Airflow native write path would benefit the community
instead of each building its own solution.


> You seem to repeat more of the same. This is exactly what we want to
avoid. IF you can push a code over API you can push Any Code. And precisely
the "Access Control" you mentioned or rejecting the call when "considering
code unsafe" those are the decisions we already deliberately decided we do
not want Airflow REST API to make. Whether the code it's generated or not
does not matter because Airflow has no idea whatsoever if it has been
manipulated with, between the time it was generated and pushed. The only
way Airflow can know that the code is not manipulated with is when it
generates DAG code on its own based on a declarative input. The limit is to
push declarative information only. You CANNOT push code via the REST API.
This is out of the question. The case is closed.

The middle loop usually happens on a Jupyter notebook, it needs to change
data/features used by model frequently which in turn leads to Airflow DAG
updates, do you mind elaborate how to automate the changes inside a
notebook and programmatically submitting DAGs through git+CI/CD while
giving user quick feedback? I understand git+ci/cd is technically possible
but the overhead involved is a major reason users rejecting Airflow for
other alternative solutions, e.g. git repo requires manual approval even if
DAGs can be programmatically submitted, and CI/CD are slow offline
processes with large repo.

Case 2 is actually (if you attempt to read my article I posted above, it's
written there) the case where shared volume could still be used and are
bette. This why it's great that Airflow supports multiple DAG syncing
solutions because your "middle" environment does not have to have git sync
as it is not "production' (unless you want to mix development with testing
that is, which is terrible, terrible idea).

Your data science for middle ground does:

a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you use
shared volume of some sort (NFS/EFS etc.)
b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags are
on S3  and synced using s3-sync
c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
synced using s3-sync

Those are excellent "File push" apis. They do the job. I cannot imagine why
the middle-loop person might have a problem with using them. All of that
can also be  fully automated -  they all have nice Python and
other language APIs so you can even make the IDE run those commands
automatically on every save if you want.

Could you please elaborate why this would be a problem to use those (really
good for file pushing) APIs ?

J.




On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <[email protected]> wrote:

> First appreciate all for your valuable feedback. Airflow by design has to
> accept code, both Tomasz and Constance's examples let me think that the
> security judgement should be on the actual DAGs rather than how DAGs are
> accepted or a process itself. To expand a little bit more on another
> example, say another service provides an API which can be invoked by its
> clients the service validates user inputs e.g. SQL and generates Airflow
> DAGs which use the validated operators/macros. Those DAGs are safe to be
> pushed through the API. There are certainly cases that DAGs may not be
> safe, e.g the API service on public cloud with shared tenants with no
> knowledge how DAGs are generated, in such cases the API service can access
> control the identity or even reject all calls when considered unsafe.
> Please let me know if the example makes sense, and if there is a common
> interest, having an Airflow native write path would benefit the community
> instead of each building its own solution.
>
> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
> here are three loops a data scientist is doing to develop a machine
> learning model:
> - inner loop: iterates on the model locally.
> - middle loop: iterate the model on a remote cluster with production data,
> say it's using Airflow DAGs behind the scenes.
> - outer loop: done with iteration and publish the model on production.
> The middle loop usually happens on a Jupyter notebook, it needs to change
> data/features used by model frequently which in turn leads to Airflow DAG
> updates, do you mind elaborate how to automate the changes inside a
> notebook and programmatically submitting DAGs through git+CI/CD while
> giving user quick feedback? I understand git+ci/cd is technically possible
> but the overhead involved is a major reason users rejecting Airflow for
> other alternative solutions, e.g. git repo requires manual approval even if
> DAGs can be programmatically submitted, and CI/CD are slow offline
> processes with large repo.
>
> Such use case is pretty common for data scientists, and a better
> **online** service model would help open up more possibilities for Airflow
> and its users, as additional layers providing more values(like Constance
> mentioned enable users with no engineering or airflow domain knowledge to
> use Airflow) could be built on top of Airflow which remains as a lower
> level orchestration engine.
>
> thanks
> Mocheng
>
>
> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <[email protected]> wrote:
>
>> I really like the Idea of Tomek.
>>
>> If we ever go (which is not unlikely) - some "standard" declarative way
>> of describing DAGs, all my security, packaging concerns are gone - and
>> submitting such declarative DAG via API is quite viable. Simply submitting
>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>> just stored in the DB and scheduled and executed using only "declaration"
>> from the DB - without ever touching the DAG "folder" and without allowing
>> the user to submit any executable code this way. All the code to execute
>> would already have to be in Airflow already in this case.
>>
>> And I very much agree also that this case can be solved with Git. I think
>> we are generally undervaluing the role Git plays for DAG distribution of
>> Airflow.
>>
>> I think when the user feels the need (I very much understand the need
>> Constance) to submit the DAG via API,  rather than adding the option of
>> submitting the DAG code via "Airflow REST API", we should simply answer
>> this:
>>
>> *Use Git and git sync. Then "Git Push" then becomes the standard "API"
>> you wanted to push the code.*
>>
>> This has all the flexibility you need, it has integration with Pull
>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>> - we have ALL of that and more for free. Standing on the shoulders of
>> giants.
>> If we start thinking about integration of code push via our own API - we
>> basically start the journey of rewriting Git as eventually we will have to
>> support those cases. This makes absolutely no sense for me.
>>
>> I even start to think that we should make "git sync" a separate (and much
>> more viable) option that is pretty much the "main recommendation" for
>> Airflow. rather than "yet another option among shared folders and baked in
>> DAGs" case.
>>
>> I recently even wrote my thoughts about it in this post: "Shared Volumes
>> in Airflow - the good, the bad and the ugly":
>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>> which has much more details on why I think so.
>>
>> J.
>>
>>
>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>> <[email protected]> wrote:
>>
>>> I understand the security concerns, and generally agree, but as a
>>> regular user I always wished we could upload DAG files via an API. It opens
>>> the door to have an upload button, which would be nice. It would make
>>> Airflow a lot more accessible to non-engineering types.
>>>
>>> I love the idea of implementing a manual review option in conjunction
>>> with some sort of hook (similar to Airflow cluster policies) would be a
>>> good middle ground. An administrator could use that hook to do checks
>>> against DAGs or run security scanners, and decide whether or not to
>>> implement a review requirement.
>>>
>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <[email protected]>
>>> wrote:
>>>
>>>> In general I second what XD said. CI/CD feels better than sending DAG
>>>> files over API and the security issues arising from accepting "any python
>>>> file" are probably quite big.
>>>>
>>>> However, I think this proposal can be tightly related to "declarative
>>>> DAGs". Instead of sending a DAG file, the user would send the DAG
>>>> definition (operators, inputs, relations) in a predefined format that is
>>>> not a code. This of course has some limitations like inability to define
>>>> custom macros, callbacks on the fly but it may be a good compromise.
>>>>
>>>> Other thought - if we implement something like "DAG via API" then we
>>>> should consider adding an option to review DAGs (approval queue etc) to
>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>> from git (where we have code review, security scanners etc).
>>>>
>>>> Cheers,
>>>> Tomek
>>>>
>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <[email protected]> wrote:
>>>>
>>>>> Hi Mocheng,
>>>>>
>>>>> Please allow me to share a question first: so in your proposal, the
>>>>> API in your plan is still accepting an Airflow DAG as the payload (just
>>>>> binarized or compressed), right?
>>>>>
>>>>> If that's the case, I may not be fully convinced: the objectives in
>>>>> your proposal is about automation & programmatically submitting DAGs. 
>>>>> These
>>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>> files).
>>>>>
>>>>> As you are already aware, allowing this via API adds additional
>>>>> security concern, and I would doubt if that "breaks even".
>>>>>
>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>> proposal. Thanks.
>>>>>
>>>>>
>>>>> Regards,
>>>>> XD
>>>>> ----------------------------------------------------------------
>>>>> (This is not a contribution)
>>>>>
>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <[email protected]> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>> based on the observations that Airflow users want to be able to access
>>>>>> Airflow more easily as a platform service.
>>>>>>
>>>>>> The motivation comes from the following use cases:
>>>>>> 1. Users like data scientists want to iterate over data quickly with
>>>>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>>>> inputs like user command or external triggers, and they want to be able 
>>>>>> to
>>>>>> submit DAGs programmatically without manual intervention.
>>>>>>
>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>> gain more customer popularity. The existing DAG repo brings considerable
>>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>>> can be slow to rollout.
>>>>>>
>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>> transmitted online and here are some key points:
>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>> over the network. For example, a DAG may be a serialized binary or a zip
>>>>>> file.
>>>>>> 2. The Airflow REST API is the ideal place to talk with the external
>>>>>> world. The API would provide a generic interface to accept DAG artifacts
>>>>>> and should be extensible to support different artifact formats if needed.
>>>>>> 3. DAG persistence needs to be implemented since they are not part of
>>>>>> the DAG repository.
>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>> execution, and web server UI should behave the same way.
>>>>>>
>>>>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>>>>> may pose high security risks. Here are a few proposals to stop the 
>>>>>> security
>>>>>> breach:
>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>>> pluggable authentication modules where strong authentication such as
>>>>>> Kerberos can be used.
>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through
>>>>>> the API service will have run_as_user set to be the API identity.
>>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>>> also be used to access the data warehouse.
>>>>>>
>>>>>> We shared a demo based on a prototype implementation in the summit
>>>>>> and some details are described in this ppt
>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>> and would love to get feedback and comments from the community about this
>>>>>> initiative.
>>>>>>
>>>>>> thanks
>>>>>> Mocheng
>>>>>>
>>>>>
>>>
>>> --
>>>
>>> Constance Martineau
>>> Product Manager
>>>
>>> Email: [email protected]
>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>
>>>
>>> <https://www.astronomer.io/>
>>>
>>>

Re: [Proposal] Creating DAG through the REST api

Reply via email to