Hi, I have also felt the need at times for creating DAG's through the REST API. But I understand the security concerns associated with such implementation.
If not submission through REST API at least some sort of *development mode* in airflow interface to create and edit dag's for a user session. Not sure if this was brought up previously. Thanks, Nishant On Thu, Aug 25, 2022 at 6:32 AM Jarek Potiuk <ja...@potiuk.com> wrote: > Just in case - please watch the devlist for the announcement of the "SIG > multitenancy" group if it slips my mind. > > On Thu, Aug 25, 2022 at 1:31 PM Jarek Potiuk <ja...@potiuk.com> wrote: > >> Cool. I will make sure to include you ! I think this is something that >> will happen in September, The holiday period is not the best to organize it. >> >> On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <gmca...@gmail.com> wrote: >> >>> My use case needs automation and security: those are the two key >>> requirements and does not have to be REST API if there is another way that >>> DAGs could be submitted to a cloud storage securely. Sure I would >>> appreciate it if you could include me when organizing AIP-1 related >>> meetings. Kerberos is a ticket based system in which a ticket has a limited >>> lifetime. Using kerberos, a workload could be authenticated before >>> persistence so that Airflow uses its kerberos keytab to execute, which is >>> similar to the current implementation in worker, another possible scenarios >>> is a persisted workload needs to include a kerberos renewable TGT to be >>> used by Airflow worker, but this is more complex and I would be happy to >>> discuss more in meetings. I will draft a more detailed document for review. >>> >>> thanks >>> Mocheng >>> >>> >>> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote: >>> >>>> None of those requirements are supported by Airflow. And opening REST >>>> API does not solve the authentication use case you mentioned. >>>> >>>> This is a completely new requirement you have - basically what you want >>>> is workflow identity and it should be rather independent from the way DAG >>>> is submitted. It would require to attach some kind of identity and >>>> signaturea and some way of making sure that the DAG has not been tampered >>>> with, in a way that the worker could use the identity when executing the >>>> workload and be sure that no-one else modified the DAG - including any of >>>> the files that the DAG uses. This is an interesting case but it has nothing >>>> to do with using or not the REST API. REST API alone will not give you the >>>> user identity guarantees that you need here. The distributed nature of >>>> Airflow basically requires such workflow identity has to be provided by >>>> cryptographic signatures and verifying the integrity of the DAG rather than >>>> basing it on REST API authentication. >>>> >>>> BTW. We do support already Kerberos authentication for some of our >>>> operators but identity is necessarily per instance executing the workload - >>>> not the user submitting the DAG. >>>> >>>> This could be one of the improvement proposals that could in the future >>>> become a sub-AIP or AIP-1 (Improve Airflow Security). if you are >>>> interested in leading and proposing such an AIP i will be soon (a month or >>>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings >>>> and minutes of previous meetings). We already have AiP-43 and AIP-44 >>>> approved there (and AIP-43 close to completion) and the next steps should >>>> be introducing fine graines security layer to executing the workloads. >>>> Adding workload identity might be part of it. If you would like to work on >>>> that - you are most welcome. It means to prepare and discuss proposals, get >>>> consensus of involved parties, leading it to a vote and finally >>>> implementing it. >>>> >>>> J >>>> >>>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gmca...@gmail.com> >>>> napisał: >>>> >>>>> >> Could you please elaborate why this would be a problem to use those >>>>> (really good for file pushing) APIs ? >>>>> >>>>> Submitting DAGs directly to cloud storage API does help for some part >>>>> of the use case requirement, but cloud storage does not provide the >>>>> security a data warehouse needs. A typical auth model supported in data >>>>> warehouse is Kerberos, and a data warehouse provides limited view to a >>>>> kerberos user with authorization rules. We need users to submit DAGs with >>>>> identities supported by the data warehouse, so that Apache Spark jobs will >>>>> be executed as the kerberos user who submits a DAG which in turns decide >>>>> what data can be processed, there may also be need to handle >>>>> impersonation, >>>>> so there needs to be an additional layer to handle data warehouse auth >>>>> e.g. >>>>> kerberos. >>>>> >>>>> Assuming dags are already inside the cloud storage, and I think >>>>> AIP-5/20 would work better than the current mono repo model if it could >>>>> support better flexibility and less latency, and I would be very >>>>> interested >>>>> to be part of the design and implementation. >>>>> >>>>> >>>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> >>>>> wrote: >>>>> >>>>>> First appreciate all for your valuable feedback. Airflow by design >>>>>> has to accept code, both Tomasz and Constance's examples let me think >>>>>> that >>>>>> the security judgement should be on the actual DAGs rather than how DAGs >>>>>> are accepted or a process itself. To expand a little bit more on another >>>>>> example, say another service provides an API which can be invoked by its >>>>>> clients the service validates user inputs e.g. SQL and generates Airflow >>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be >>>>>> pushed through the API. There are certainly cases that DAGs may not be >>>>>> safe, e.g the API service on public cloud with shared tenants with no >>>>>> knowledge how DAGs are generated, in such cases the API service can >>>>>> access >>>>>> control the identity or even reject all calls when considered unsafe. >>>>>> Please let me know if the example makes sense, and if there is a common >>>>>> interest, having an Airflow native write path would benefit the community >>>>>> instead of each building its own solution. >>>>>> >>>>>> > You seem to repeat more of the same. This is exactly what we want >>>>>> to avoid. IF you can push a code over API you can push Any Code. And >>>>>> precisely the "Access Control" you mentioned or rejecting the call when >>>>>> "considering code unsafe" those are the decisions we already deliberately >>>>>> decided we do not want Airflow REST API to make. Whether the code it's >>>>>> generated or not does not matter because Airflow has no idea whatsoever >>>>>> if >>>>>> it has been manipulated with, between the time it was generated and >>>>>> pushed. >>>>>> The only way Airflow can know that the code is not manipulated with is >>>>>> when >>>>>> it generates DAG code on its own based on a declarative input. The limit >>>>>> is >>>>>> to push declarative information only. You CANNOT push code via the REST >>>>>> API. This is out of the question. The case is closed. >>>>>> >>>>>> The middle loop usually happens on a Jupyter notebook, it needs to >>>>>> change data/features used by model frequently which in turn leads to >>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes >>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD >>>>>> while giving user quick feedback? I understand git+ci/cd is technically >>>>>> possible but the overhead involved is a major reason users rejecting >>>>>> Airflow for other alternative solutions, e.g. git repo requires manual >>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are >>>>>> slow >>>>>> offline processes with large repo. >>>>>> >>>>>> Case 2 is actually (if you attempt to read my article I posted above, >>>>>> it's written there) the case where shared volume could still be used and >>>>>> are bette. This why it's great that Airflow supports multiple DAG syncing >>>>>> solutions because your "middle" environment does not have to have git >>>>>> sync >>>>>> as it is not "production' (unless you want to mix development with >>>>>> testing >>>>>> that is, which is terrible, terrible idea). >>>>>> >>>>>> Your data science for middle ground does: >>>>>> >>>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if >>>>>> you use shared volume of some sort (NFS/EFS etc.) >>>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags >>>>>> are on S3 and synced using s3-sync >>>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and >>>>>> synced using s3-sync >>>>>> >>>>>> Those are excellent "File push" apis. They do the job. I cannot >>>>>> imagine why the middle-loop person might have a problem with using them. >>>>>> All of that can also be fully automated - they all have nice Python and >>>>>> other language APIs so you can even make the IDE run those commands >>>>>> automatically on every save if you want. >>>>>> >>>>>> Could you please elaborate why this would be a problem to use those >>>>>> (really good for file pushing) APIs ? >>>>>> >>>>>> J. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gmca...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> First appreciate all for your valuable feedback. Airflow by design >>>>>>> has to accept code, both Tomasz and Constance's examples let me think >>>>>>> that >>>>>>> the security judgement should be on the actual DAGs rather than how DAGs >>>>>>> are accepted or a process itself. To expand a little bit more on another >>>>>>> example, say another service provides an API which can be invoked by its >>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow >>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be >>>>>>> pushed through the API. There are certainly cases that DAGs may not be >>>>>>> safe, e.g the API service on public cloud with shared tenants with no >>>>>>> knowledge how DAGs are generated, in such cases the API service can >>>>>>> access >>>>>>> control the identity or even reject all calls when considered unsafe. >>>>>>> Please let me know if the example makes sense, and if there is a common >>>>>>> interest, having an Airflow native write path would benefit the >>>>>>> community >>>>>>> instead of each building its own solution. >>>>>>> >>>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use >>>>>>> case, here are three loops a data scientist is doing to develop a >>>>>>> machine >>>>>>> learning model: >>>>>>> - inner loop: iterates on the model locally. >>>>>>> - middle loop: iterate the model on a remote cluster with production >>>>>>> data, say it's using Airflow DAGs behind the scenes. >>>>>>> - outer loop: done with iteration and publish the model on >>>>>>> production. >>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to >>>>>>> change data/features used by model frequently which in turn leads to >>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes >>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD >>>>>>> while giving user quick feedback? I understand git+ci/cd is technically >>>>>>> possible but the overhead involved is a major reason users rejecting >>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual >>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are >>>>>>> slow >>>>>>> offline processes with large repo. >>>>>>> >>>>>>> Such use case is pretty common for data scientists, and a better >>>>>>> **online** service model would help open up more possibilities for >>>>>>> Airflow >>>>>>> and its users, as additional layers providing more values(like Constance >>>>>>> mentioned enable users with no engineering or airflow domain knowledge >>>>>>> to >>>>>>> use Airflow) could be built on top of Airflow which remains as a lower >>>>>>> level orchestration engine. >>>>>>> >>>>>>> thanks >>>>>>> Mocheng >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I really like the Idea of Tomek. >>>>>>>> >>>>>>>> If we ever go (which is not unlikely) - some "standard" declarative >>>>>>>> way of describing DAGs, all my security, packaging concerns are gone - >>>>>>>> and >>>>>>>> submitting such declarative DAG via API is quite viable. Simply >>>>>>>> submitting >>>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG >>>>>>>> could be >>>>>>>> just stored in the DB and scheduled and executed using only >>>>>>>> "declaration" >>>>>>>> from the DB - without ever touching the DAG "folder" and without >>>>>>>> allowing >>>>>>>> the user to submit any executable code this way. All the code to >>>>>>>> execute >>>>>>>> would already have to be in Airflow already in this case. >>>>>>>> >>>>>>>> And I very much agree also that this case can be solved with Git. I >>>>>>>> think we are generally undervaluing the role Git plays for DAG >>>>>>>> distribution >>>>>>>> of Airflow. >>>>>>>> >>>>>>>> I think when the user feels the need (I very much understand the >>>>>>>> need Constance) to submit the DAG via API, rather than adding the >>>>>>>> option >>>>>>>> of submitting the DAG code via "Airflow REST API", we should simply >>>>>>>> answer >>>>>>>> this: >>>>>>>> >>>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard >>>>>>>> "API" you wanted to push the code.* >>>>>>>> >>>>>>>> This has all the flexibility you need, it has integration with Pull >>>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use >>>>>>>> Git" >>>>>>>> - we have ALL of that and more for free. Standing on the shoulders of >>>>>>>> giants. >>>>>>>> If we start thinking about integration of code push via our own API >>>>>>>> - we basically start the journey of rewriting Git as eventually we will >>>>>>>> have to support those cases. This makes absolutely no sense for me. >>>>>>>> >>>>>>>> I even start to think that we should make "git sync" a separate >>>>>>>> (and much more viable) option that is pretty much the "main >>>>>>>> recommendation" >>>>>>>> for Airflow. rather than "yet another option among shared folders and >>>>>>>> baked >>>>>>>> in DAGs" case. >>>>>>>> >>>>>>>> I recently even wrote my thoughts about it in this post: "Shared >>>>>>>> Volumes in Airflow - the good, the bad and the ugly": >>>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca >>>>>>>> which has much more details on why I think so. >>>>>>>> >>>>>>>> J. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau >>>>>>>> <consta...@astronomer.io.invalid> wrote: >>>>>>>> >>>>>>>>> I understand the security concerns, and generally agree, but as a >>>>>>>>> regular user I always wished we could upload DAG files via an API. It >>>>>>>>> opens >>>>>>>>> the door to have an upload button, which would be nice. It would make >>>>>>>>> Airflow a lot more accessible to non-engineering types. >>>>>>>>> >>>>>>>>> I love the idea of implementing a manual review option in >>>>>>>>> conjunction with some sort of hook (similar to Airflow cluster >>>>>>>>> policies) >>>>>>>>> would be a good middle ground. An administrator could use that hook >>>>>>>>> to do >>>>>>>>> checks against DAGs or run security scanners, and decide whether or >>>>>>>>> not to >>>>>>>>> implement a review requirement. >>>>>>>>> >>>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek < >>>>>>>>> turbas...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> In general I second what XD said. CI/CD feels better than sending >>>>>>>>>> DAG files over API and the security issues arising from accepting >>>>>>>>>> "any >>>>>>>>>> python file" are probably quite big. >>>>>>>>>> >>>>>>>>>> However, I think this proposal can be tightly related to >>>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would >>>>>>>>>> send the >>>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format >>>>>>>>>> that is not a code. This of course has some limitations like >>>>>>>>>> inability to >>>>>>>>>> define custom macros, callbacks on the fly but it may be a good >>>>>>>>>> compromise. >>>>>>>>>> >>>>>>>>>> Other thought - if we implement something like "DAG via API" then >>>>>>>>>> we should consider adding an option to review DAGs (approval queue >>>>>>>>>> etc) to >>>>>>>>>> reduce security issues that are mitigated by for example deploying >>>>>>>>>> DAGs >>>>>>>>>> from git (where we have code review, security scanners etc). >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Tomek >>>>>>>>>> >>>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Mocheng, >>>>>>>>>>> >>>>>>>>>>> Please allow me to share a question first: so in your proposal, >>>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the >>>>>>>>>>> payload (just >>>>>>>>>>> binarized or compressed), right? >>>>>>>>>>> >>>>>>>>>>> If that's the case, I may not be fully convinced: the objectives >>>>>>>>>>> in your proposal is about automation & programmatically submitting >>>>>>>>>>> DAGs. >>>>>>>>>>> These can already be achieved in an efficient way through CI/CD >>>>>>>>>>> practice + >>>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host >>>>>>>>>>> the DAG >>>>>>>>>>> files). >>>>>>>>>>> >>>>>>>>>>> As you are already aware, allowing this via API adds additional >>>>>>>>>>> security concern, and I would doubt if that "breaks even". >>>>>>>>>>> >>>>>>>>>>> Kindly let me know if I have missed anything or misunderstood >>>>>>>>>>> your proposal. Thanks. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> XD >>>>>>>>>>> ---------------------------------------------------------------- >>>>>>>>>>> (This is not a contribution) >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>> >>>>>>>>>>>> I have an enhancement proposal for the REST API service. This >>>>>>>>>>>> is based on the observations that Airflow users want to be able to >>>>>>>>>>>> access >>>>>>>>>>>> Airflow more easily as a platform service. >>>>>>>>>>>> >>>>>>>>>>>> The motivation comes from the following use cases: >>>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly >>>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines >>>>>>>>>>>> inside >>>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster. >>>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs >>>>>>>>>>>> based on inputs like user command or external triggers, and they >>>>>>>>>>>> want to be >>>>>>>>>>>> able to submit DAGs programmatically without manual intervention. >>>>>>>>>>>> >>>>>>>>>>>> I believe such use cases would help promote Airflow usability >>>>>>>>>>>> and gain more customer popularity. The existing DAG repo brings >>>>>>>>>>>> considerable overhead for such scenarios, a shared repo requires >>>>>>>>>>>> offline >>>>>>>>>>>> processes and can be slow to rollout. >>>>>>>>>>>> >>>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be >>>>>>>>>>>> transmitted online and here are some key points: >>>>>>>>>>>> 1. A DAG is packaged individually so that it can be >>>>>>>>>>>> distributable over the network. For example, a DAG may be a >>>>>>>>>>>> serialized >>>>>>>>>>>> binary or a zip file. >>>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the >>>>>>>>>>>> external world. The API would provide a generic interface to >>>>>>>>>>>> accept DAG >>>>>>>>>>>> artifacts and should be extensible to support different artifact >>>>>>>>>>>> formats if >>>>>>>>>>>> needed. >>>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not >>>>>>>>>>>> part of the DAG repository. >>>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in >>>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its >>>>>>>>>>>> scheduling, >>>>>>>>>>>> execution, and web server UI should behave the same way. >>>>>>>>>>>> >>>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside >>>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to >>>>>>>>>>>> stop the >>>>>>>>>>>> security breach: >>>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already >>>>>>>>>>>> supports pluggable authentication modules where strong >>>>>>>>>>>> authentication such >>>>>>>>>>>> as Kerberos can be used. >>>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created >>>>>>>>>>>> through the API service will have run_as_user set to be the API >>>>>>>>>>>> identity. >>>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity >>>>>>>>>>>> should also be used to access the data warehouse. >>>>>>>>>>>> >>>>>>>>>>>> We shared a demo based on a prototype implementation in the >>>>>>>>>>>> summit and some details are described in this ppt >>>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>, >>>>>>>>>>>> and would love to get feedback and comments from the community >>>>>>>>>>>> about this >>>>>>>>>>>> initiative. >>>>>>>>>>>> >>>>>>>>>>>> thanks >>>>>>>>>>>> Mocheng >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Constance Martineau >>>>>>>>> Product Manager >>>>>>>>> >>>>>>>>> Email: consta...@astronomer.io >>>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4) >>>>>>>>> >>>>>>>>> >>>>>>>>> <https://www.astronomer.io/> >>>>>>>>> >>>>>>>>>