My use case needs automation and security: those are the two key requirements and does not have to be REST API if there is another way that DAGs could be submitted to a cloud storage securely. Sure I would appreciate it if you could include me when organizing AIP-1 related meetings. Kerberos is a ticket based system in which a ticket has a limited lifetime. Using kerberos, a workload could be authenticated before persistence so that Airflow uses its kerberos keytab to execute, which is similar to the current implementation in worker, another possible scenarios is a persisted workload needs to include a kerberos renewable TGT to be used by Airflow worker, but this is more complex and I would be happy to discuss more in meetings. I will draft a more detailed document for review.
thanks Mocheng On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote: > None of those requirements are supported by Airflow. And opening REST API > does not solve the authentication use case you mentioned. > > This is a completely new requirement you have - basically what you want is > workflow identity and it should be rather independent from the way DAG is > submitted. It would require to attach some kind of identity and signaturea > and some way of making sure that the DAG has not been tampered with, in a > way that the worker could use the identity when executing the workload and > be sure that no-one else modified the DAG - including any of the files that > the DAG uses. This is an interesting case but it has nothing to do with > using or not the REST API. REST API alone will not give you the user > identity guarantees that you need here. The distributed nature of Airflow > basically requires such workflow identity has to be provided by > cryptographic signatures and verifying the integrity of the DAG rather than > basing it on REST API authentication. > > BTW. We do support already Kerberos authentication for some of our > operators but identity is necessarily per instance executing the workload - > not the user submitting the DAG. > > This could be one of the improvement proposals that could in the future > become a sub-AIP or AIP-1 (Improve Airflow Security). if you are > interested in leading and proposing such an AIP i will be soon (a month or > so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings > and minutes of previous meetings). We already have AiP-43 and AIP-44 > approved there (and AIP-43 close to completion) and the next steps should > be introducing fine graines security layer to executing the workloads. > Adding workload identity might be part of it. If you would like to work on > that - you are most welcome. It means to prepare and discuss proposals, get > consensus of involved parties, leading it to a vote and finally > implementing it. > > J > > czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gmca...@gmail.com> > napisał: > >> >> Could you please elaborate why this would be a problem to use those >> (really good for file pushing) APIs ? >> >> Submitting DAGs directly to cloud storage API does help for some part of >> the use case requirement, but cloud storage does not provide the security a >> data warehouse needs. A typical auth model supported in data warehouse is >> Kerberos, and a data warehouse provides limited view to a kerberos user >> with authorization rules. We need users to submit DAGs with identities >> supported by the data warehouse, so that Apache Spark jobs will be executed >> as the kerberos user who submits a DAG which in turns decide what data can >> be processed, there may also be need to handle impersonation, so there >> needs to be an additional layer to handle data warehouse auth e.g. >> kerberos. >> >> Assuming dags are already inside the cloud storage, and I think AIP-5/20 >> would work better than the current mono repo model if it could support >> better flexibility and less latency, and I would be very interested to be >> part of the design and implementation. >> >> >> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote: >> >>> First appreciate all for your valuable feedback. Airflow by design has >>> to accept code, both Tomasz and Constance's examples let me think that the >>> security judgement should be on the actual DAGs rather than how DAGs are >>> accepted or a process itself. To expand a little bit more on another >>> example, say another service provides an API which can be invoked by its >>> clients the service validates user inputs e.g. SQL and generates Airflow >>> DAGs which use the validated operators/macros. Those DAGs are safe to be >>> pushed through the API. There are certainly cases that DAGs may not be >>> safe, e.g the API service on public cloud with shared tenants with no >>> knowledge how DAGs are generated, in such cases the API service can access >>> control the identity or even reject all calls when considered unsafe. >>> Please let me know if the example makes sense, and if there is a common >>> interest, having an Airflow native write path would benefit the community >>> instead of each building its own solution. >>> >>> > You seem to repeat more of the same. This is exactly what we want to >>> avoid. IF you can push a code over API you can push Any Code. And precisely >>> the "Access Control" you mentioned or rejecting the call when "considering >>> code unsafe" those are the decisions we already deliberately decided we do >>> not want Airflow REST API to make. Whether the code it's generated or not >>> does not matter because Airflow has no idea whatsoever if it has been >>> manipulated with, between the time it was generated and pushed. The only >>> way Airflow can know that the code is not manipulated with is when it >>> generates DAG code on its own based on a declarative input. The limit is to >>> push declarative information only. You CANNOT push code via the REST API. >>> This is out of the question. The case is closed. >>> >>> The middle loop usually happens on a Jupyter notebook, it needs to >>> change data/features used by model frequently which in turn leads to >>> Airflow DAG updates, do you mind elaborate how to automate the changes >>> inside a notebook and programmatically submitting DAGs through git+CI/CD >>> while giving user quick feedback? I understand git+ci/cd is technically >>> possible but the overhead involved is a major reason users rejecting >>> Airflow for other alternative solutions, e.g. git repo requires manual >>> approval even if DAGs can be programmatically submitted, and CI/CD are slow >>> offline processes with large repo. >>> >>> Case 2 is actually (if you attempt to read my article I posted above, >>> it's written there) the case where shared volume could still be used and >>> are bette. This why it's great that Airflow supports multiple DAG syncing >>> solutions because your "middle" environment does not have to have git sync >>> as it is not "production' (unless you want to mix development with testing >>> that is, which is terrible, terrible idea). >>> >>> Your data science for middle ground does: >>> >>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you >>> use shared volume of some sort (NFS/EFS etc.) >>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags >>> are on S3 and synced using s3-sync >>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and >>> synced using s3-sync >>> >>> Those are excellent "File push" apis. They do the job. I cannot imagine >>> why the middle-loop person might have a problem with using them. All of >>> that can also be fully automated - they all have nice Python and >>> other language APIs so you can even make the IDE run those commands >>> automatically on every save if you want. >>> >>> Could you please elaborate why this would be a problem to use those >>> (really good for file pushing) APIs ? >>> >>> J. >>> >>> >>> >>> >>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gmca...@gmail.com> wrote: >>> >>>> First appreciate all for your valuable feedback. Airflow by design has >>>> to accept code, both Tomasz and Constance's examples let me think that the >>>> security judgement should be on the actual DAGs rather than how DAGs are >>>> accepted or a process itself. To expand a little bit more on another >>>> example, say another service provides an API which can be invoked by its >>>> clients the service validates user inputs e.g. SQL and generates Airflow >>>> DAGs which use the validated operators/macros. Those DAGs are safe to be >>>> pushed through the API. There are certainly cases that DAGs may not be >>>> safe, e.g the API service on public cloud with shared tenants with no >>>> knowledge how DAGs are generated, in such cases the API service can access >>>> control the identity or even reject all calls when considered unsafe. >>>> Please let me know if the example makes sense, and if there is a common >>>> interest, having an Airflow native write path would benefit the community >>>> instead of each building its own solution. >>>> >>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case, >>>> here are three loops a data scientist is doing to develop a machine >>>> learning model: >>>> - inner loop: iterates on the model locally. >>>> - middle loop: iterate the model on a remote cluster with production >>>> data, say it's using Airflow DAGs behind the scenes. >>>> - outer loop: done with iteration and publish the model on production. >>>> The middle loop usually happens on a Jupyter notebook, it needs to >>>> change data/features used by model frequently which in turn leads to >>>> Airflow DAG updates, do you mind elaborate how to automate the changes >>>> inside a notebook and programmatically submitting DAGs through git+CI/CD >>>> while giving user quick feedback? I understand git+ci/cd is technically >>>> possible but the overhead involved is a major reason users rejecting >>>> Airflow for other alternative solutions, e.g. git repo requires manual >>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow >>>> offline processes with large repo. >>>> >>>> Such use case is pretty common for data scientists, and a better >>>> **online** service model would help open up more possibilities for Airflow >>>> and its users, as additional layers providing more values(like Constance >>>> mentioned enable users with no engineering or airflow domain knowledge to >>>> use Airflow) could be built on top of Airflow which remains as a lower >>>> level orchestration engine. >>>> >>>> thanks >>>> Mocheng >>>> >>>> >>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>>> >>>>> I really like the Idea of Tomek. >>>>> >>>>> If we ever go (which is not unlikely) - some "standard" declarative >>>>> way of describing DAGs, all my security, packaging concerns are gone - and >>>>> submitting such declarative DAG via API is quite viable. Simply submitting >>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be >>>>> just stored in the DB and scheduled and executed using only "declaration" >>>>> from the DB - without ever touching the DAG "folder" and without allowing >>>>> the user to submit any executable code this way. All the code to execute >>>>> would already have to be in Airflow already in this case. >>>>> >>>>> And I very much agree also that this case can be solved with Git. I >>>>> think we are generally undervaluing the role Git plays for DAG >>>>> distribution >>>>> of Airflow. >>>>> >>>>> I think when the user feels the need (I very much understand the need >>>>> Constance) to submit the DAG via API, rather than adding the option of >>>>> submitting the DAG code via "Airflow REST API", we should simply answer >>>>> this: >>>>> >>>>> *Use Git and git sync. Then "Git Push" then becomes the standard "API" >>>>> you wanted to push the code.* >>>>> >>>>> This has all the flexibility you need, it has integration with Pull >>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use >>>>> Git" >>>>> - we have ALL of that and more for free. Standing on the shoulders of >>>>> giants. >>>>> If we start thinking about integration of code push via our own API - >>>>> we basically start the journey of rewriting Git as eventually we will have >>>>> to support those cases. This makes absolutely no sense for me. >>>>> >>>>> I even start to think that we should make "git sync" a separate (and >>>>> much more viable) option that is pretty much the "main recommendation" for >>>>> Airflow. rather than "yet another option among shared folders and baked in >>>>> DAGs" case. >>>>> >>>>> I recently even wrote my thoughts about it in this post: "Shared >>>>> Volumes in Airflow - the good, the bad and the ugly": >>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca >>>>> which has much more details on why I think so. >>>>> >>>>> J. >>>>> >>>>> >>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau >>>>> <consta...@astronomer.io.invalid> wrote: >>>>> >>>>>> I understand the security concerns, and generally agree, but as a >>>>>> regular user I always wished we could upload DAG files via an API. It >>>>>> opens >>>>>> the door to have an upload button, which would be nice. It would make >>>>>> Airflow a lot more accessible to non-engineering types. >>>>>> >>>>>> I love the idea of implementing a manual review option in conjunction >>>>>> with some sort of hook (similar to Airflow cluster policies) would be a >>>>>> good middle ground. An administrator could use that hook to do checks >>>>>> against DAGs or run security scanners, and decide whether or not to >>>>>> implement a review requirement. >>>>>> >>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <turbas...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> In general I second what XD said. CI/CD feels better than sending >>>>>>> DAG files over API and the security issues arising from accepting "any >>>>>>> python file" are probably quite big. >>>>>>> >>>>>>> However, I think this proposal can be tightly related to >>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send >>>>>>> the >>>>>>> DAG definition (operators, inputs, relations) in a predefined format >>>>>>> that is not a code. This of course has some limitations like inability >>>>>>> to >>>>>>> define custom macros, callbacks on the fly but it may be a good >>>>>>> compromise. >>>>>>> >>>>>>> Other thought - if we implement something like "DAG via API" then we >>>>>>> should consider adding an option to review DAGs (approval queue etc) to >>>>>>> reduce security issues that are mitigated by for example deploying DAGs >>>>>>> from git (where we have code review, security scanners etc). >>>>>>> >>>>>>> Cheers, >>>>>>> Tomek >>>>>>> >>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xdd...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Mocheng, >>>>>>>> >>>>>>>> Please allow me to share a question first: so in your proposal, the >>>>>>>> API in your plan is still accepting an Airflow DAG as the payload (just >>>>>>>> binarized or compressed), right? >>>>>>>> >>>>>>>> If that's the case, I may not be fully convinced: the objectives in >>>>>>>> your proposal is about automation & programmatically submitting DAGs. >>>>>>>> These >>>>>>>> can already be achieved in an efficient way through CI/CD practice + a >>>>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG >>>>>>>> files). >>>>>>>> >>>>>>>> As you are already aware, allowing this via API adds additional >>>>>>>> security concern, and I would doubt if that "breaks even". >>>>>>>> >>>>>>>> Kindly let me know if I have missed anything or misunderstood your >>>>>>>> proposal. Thanks. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> XD >>>>>>>> ---------------------------------------------------------------- >>>>>>>> (This is not a contribution) >>>>>>>> >>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Everyone, >>>>>>>>> >>>>>>>>> I have an enhancement proposal for the REST API service. This is >>>>>>>>> based on the observations that Airflow users want to be able to access >>>>>>>>> Airflow more easily as a platform service. >>>>>>>>> >>>>>>>>> The motivation comes from the following use cases: >>>>>>>>> 1. Users like data scientists want to iterate over data quickly >>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines >>>>>>>>> inside >>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster. >>>>>>>>> 2. Services targeting specific audiences can generate DAGs based >>>>>>>>> on inputs like user command or external triggers, and they want to be >>>>>>>>> able >>>>>>>>> to submit DAGs programmatically without manual intervention. >>>>>>>>> >>>>>>>>> I believe such use cases would help promote Airflow usability and >>>>>>>>> gain more customer popularity. The existing DAG repo brings >>>>>>>>> considerable >>>>>>>>> overhead for such scenarios, a shared repo requires offline processes >>>>>>>>> and >>>>>>>>> can be slow to rollout. >>>>>>>>> >>>>>>>>> The proposal aims to provide an alternative where a DAG can be >>>>>>>>> transmitted online and here are some key points: >>>>>>>>> 1. A DAG is packaged individually so that it can be distributable >>>>>>>>> over the network. For example, a DAG may be a serialized binary or a >>>>>>>>> zip >>>>>>>>> file. >>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the >>>>>>>>> external world. The API would provide a generic interface to accept >>>>>>>>> DAG >>>>>>>>> artifacts and should be extensible to support different artifact >>>>>>>>> formats if >>>>>>>>> needed. >>>>>>>>> 3. DAG persistence needs to be implemented since they are not part >>>>>>>>> of the DAG repository. >>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the >>>>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling, >>>>>>>>> execution, and web server UI should behave the same way. >>>>>>>>> >>>>>>>>> Since DAGs are written as code, running arbitrary code inside >>>>>>>>> Airflow may pose high security risks. Here are a few proposals to >>>>>>>>> stop the >>>>>>>>> security breach: >>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports >>>>>>>>> pluggable authentication modules where strong authentication such as >>>>>>>>> Kerberos can be used. >>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created >>>>>>>>> through the API service will have run_as_user set to be the API >>>>>>>>> identity. >>>>>>>>> 3. To enforce data access control on DAGs, the API identity should >>>>>>>>> also be used to access the data warehouse. >>>>>>>>> >>>>>>>>> We shared a demo based on a prototype implementation in the summit >>>>>>>>> and some details are described in this ppt >>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>, >>>>>>>>> and would love to get feedback and comments from the community about >>>>>>>>> this >>>>>>>>> initiative. >>>>>>>>> >>>>>>>>> thanks >>>>>>>>> Mocheng >>>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Constance Martineau >>>>>> Product Manager >>>>>> >>>>>> Email: consta...@astronomer.io >>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4) >>>>>> >>>>>> >>>>>> <https://www.astronomer.io/> >>>>>> >>>>>>