I really like the Idea of Tomek. If we ever go (which is not unlikely) - some "standard" declarative way of describing DAGs, all my security, packaging concerns are gone - and submitting such declarative DAG via API is quite viable. Simply submitting a Python code this way is a no-go for me :). Such Declarative DAG could be just stored in the DB and scheduled and executed using only "declaration" from the DB - without ever touching the DAG "folder" and without allowing the user to submit any executable code this way. All the code to execute would already have to be in Airflow already in this case.
And I very much agree also that this case can be solved with Git. I think we are generally undervaluing the role Git plays for DAG distribution of Airflow. I think when the user feels the need (I very much understand the need Constance) to submit the DAG via API, rather than adding the option of submitting the DAG code via "Airflow REST API", we should simply answer this: *Use Git and git sync. Then "Git Push" then becomes the standard "API" you wanted to push the code.* This has all the flexibility you need, it has integration with Pull Request, CI workflows, keeps history etc.etc. When we tell people "Use Git" - we have ALL of that and more for free. Standing on the shoulders of giants. If we start thinking about integration of code push via our own API - we basically start the journey of rewriting Git as eventually we will have to support those cases. This makes absolutely no sense for me. I even start to think that we should make "git sync" a separate (and much more viable) option that is pretty much the "main recommendation" for Airflow. rather than "yet another option among shared folders and baked in DAGs" case. I recently even wrote my thoughts about it in this post: "Shared Volumes in Airflow - the good, the bad and the ugly": https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca which has much more details on why I think so. J. On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau <consta...@astronomer.io.invalid> wrote: > I understand the security concerns, and generally agree, but as a regular > user I always wished we could upload DAG files via an API. It opens the > door to have an upload button, which would be nice. It would make Airflow a > lot more accessible to non-engineering types. > > I love the idea of implementing a manual review option in conjunction with > some sort of hook (similar to Airflow cluster policies) would be a good > middle ground. An administrator could use that hook to do checks against > DAGs or run security scanners, and decide whether or not to implement a > review requirement. > > On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <turbas...@apache.org> > wrote: > >> In general I second what XD said. CI/CD feels better than sending DAG >> files over API and the security issues arising from accepting "any python >> file" are probably quite big. >> >> However, I think this proposal can be tightly related to "declarative >> DAGs". Instead of sending a DAG file, the user would send the DAG >> definition (operators, inputs, relations) in a predefined format that is >> not a code. This of course has some limitations like inability to define >> custom macros, callbacks on the fly but it may be a good compromise. >> >> Other thought - if we implement something like "DAG via API" then we >> should consider adding an option to review DAGs (approval queue etc) to >> reduce security issues that are mitigated by for example deploying DAGs >> from git (where we have code review, security scanners etc). >> >> Cheers, >> Tomek >> >> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org> wrote: >> >>> Hi Mocheng, >>> >>> Please allow me to share a question first: so in your proposal, the API >>> in your plan is still accepting an Airflow DAG as the payload (just >>> binarized or compressed), right? >>> >>> If that's the case, I may not be fully convinced: the objectives in your >>> proposal is about automation & programmatically submitting DAGs. These can >>> already be achieved in an efficient way through CI/CD practice + a >>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG >>> files). >>> >>> As you are already aware, allowing this via API adds additional security >>> concern, and I would doubt if that "breaks even". >>> >>> Kindly let me know if I have missed anything or misunderstood your >>> proposal. Thanks. >>> >>> >>> Regards, >>> XD >>> ---------------------------------------------------------------- >>> (This is not a contribution) >>> >>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote: >>> >>>> Hi Everyone, >>>> >>>> I have an enhancement proposal for the REST API service. This is based >>>> on the observations that Airflow users want to be able to access Airflow >>>> more easily as a platform service. >>>> >>>> The motivation comes from the following use cases: >>>> 1. Users like data scientists want to iterate over data quickly with >>>> interactive feedback in minutes, e.g. managing data pipelines inside >>>> Jupyter Notebook while executing them in a remote airflow cluster. >>>> 2. Services targeting specific audiences can generate DAGs based on >>>> inputs like user command or external triggers, and they want to be able to >>>> submit DAGs programmatically without manual intervention. >>>> >>>> I believe such use cases would help promote Airflow usability and gain >>>> more customer popularity. The existing DAG repo brings considerable >>>> overhead for such scenarios, a shared repo requires offline processes and >>>> can be slow to rollout. >>>> >>>> The proposal aims to provide an alternative where a DAG can be >>>> transmitted online and here are some key points: >>>> 1. A DAG is packaged individually so that it can be distributable over >>>> the network. For example, a DAG may be a serialized binary or a zip file. >>>> 2. The Airflow REST API is the ideal place to talk with the external >>>> world. The API would provide a generic interface to accept DAG artifacts >>>> and should be extensible to support different artifact formats if needed. >>>> 3. DAG persistence needs to be implemented since they are not part of >>>> the DAG repository. >>>> 4. Same behavior for DAGs supported in API vs those defined in the >>>> repo, i.e. users write DAGs in the same syntax, and its scheduling, >>>> execution, and web server UI should behave the same way. >>>> >>>> Since DAGs are written as code, running arbitrary code inside Airflow >>>> may pose high security risks. Here are a few proposals to stop the security >>>> breach: >>>> 1. Accept DAGs only from trusted parties. Airflow already supports >>>> pluggable authentication modules where strong authentication such as >>>> Kerberos can be used. >>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the >>>> API service will have run_as_user set to be the API identity. >>>> 3. To enforce data access control on DAGs, the API identity should also >>>> be used to access the data warehouse. >>>> >>>> We shared a demo based on a prototype implementation in the summit and >>>> some details are described in this ppt >>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>, >>>> and would love to get feedback and comments from the community about this >>>> initiative. >>>> >>>> thanks >>>> Mocheng >>>> >>> > > -- > > Constance Martineau > Product Manager > > Email: consta...@astronomer.io > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > <https://www.astronomer.io/> > >