potiuk commented on issue #24740: URL: https://github.com/apache/airflow/issues/24740#issuecomment-1170853057
> 1. It's yet another point of failure, another credential at risk. I'd rather restrict source access to my CI servers, which need such privilege anyway. This is falling into fallacy of "tninking that things are magically working" in distributed systems. They don't. You already have that credentials in ANY scenario here. They are just leaving elsewhere. Airflow is a distributed system and uploading dag files via "REST" does not make the credentials needed to distribute the files disappear. They still need to be there - just elsewhere. * With GitSync, S3, Shared filesystem you have either explicit or implicit credentials that you need to give to any of the components involved into the syncing process. They might be either explicit (provided in GitSync config) or S3 env variable or filesystem configuration, or metadata-server in Cloud - plenty of options. But this is essentially one computer, talking to other over a network. It HAS TO have credentials - no way around it. * What's worse, with your proposed solution, the REST API will at most upload the files to a single machine (say webserver). And that machine has to have credentials (either implicit or explicit) to actually WRITE those DAG files to all the other components. Not only make it a single point of failure, and far more "centralized" system to deal with all the problems of distributing the files to potentially multiple physical machines, but it DOES NOT remove the problem of having a credential on that machine. What's even more this makes it far more dangerous, because now airlflow webserver (or whatever component will handle the API has to have "WRITE" access to DAGs. This is SUPER dangerous. In the current setup there is no need whatsoever for a write acess to either Git/S2/shared filesytems by any airflow components, they only need READ access. And the management of "who" is able to write the DAGs where is completely delegated out to a deployment. Airflow has 0 knowldege and impact there (which allows you for example to implement something that Composer team just did - per folder write access to sub-folders of DAG. With the REST API of Airlfow you essentially introduce all that write access capabilities and whole management of it to Airlfow. This is not something we are going to accept I think, this is a major change in the whole security model Airlfow is built on. > 2. There's no easy way of giving fine-grained access to the contents of a Git repo. Hence, using `git sync` means giving Airflow acess to entire repos. That's specially worrisome for folks using monorepos. Nope. this is completely wrong assumption. Look at this talk from JAGEX from Airflow Summit. They use submodules and implemented a system where they have "ONE REPO PER DAG(!)" and can easily manage it using central repo with submodules and they are extremely happy with it - https://youtu.be/uA-8Lj1RNgw. - look how happy and simling Anum was when he explained how it solved all of their problems with managing DAGs at scale with multilple teams. Learn from the best and stand on the shoulders of giants, I would say. > 1. It would prevent another component of my infrastructure needing access to source. Instead it forces the airflow component that takes the file in to have write access. This is just moving the problem elsewhere, not solving it. You have to manage your own deployment anyway, it does not free you from that responsibility, By doing it, you are asking your security guys to learn a new component (Airlfow API server) which will have a write access to most crucial part of your system (DAGS folder) in order to audit, control, and manage it's security. You add your security people more work, because they have one more system with write access to mange and one they are not familiar with (unlike standard Git, Filesystems, S3 where they have all the tools and practices they are familiar with. This is asking for trouble. Have you actually asked your secuirty people if they would like to learn a security managing around a new component they are not familiar with ? The fact is tha you assume that the security problem will magically disappear if you move it to another component - but it won't. It will just move elsewhere. > 2. It would make DAG deployment friendler to CI/CD scenarios, which in many organizations are at greater security and operational scrutiny. Why? I do not understand it. Can you please explain why this is so ? How "cp X Y" or "git push" is more complex than "find and zip all the files, make it an attachment and run complex curl command to push the file (and deal with all potential errors it might cause). REST API is particularly bad in pushing multiple file changes. It is not designed to accept multiple files. you literally have to "compress them" in a single blob and upload or (worse) make a complex multi-part-encoded message and handle it somehow on the server side. Moreover, if you had really big number of files and only part of the files change, you either have to upload all of them or figure out an algorithm and API to allow for partial or incremental uploads. All those features you need to implement on top of REST API. On the other hand, git, S3, remote filesystem have fantastic, non-REST APIS that are designed to distribute multiple files. They were created with precisely this purpose in mind. Heck. Git even has sophisticated algorithms to only send incremental changes to literally millions of source files. It's been exactly designed to handle that (and only that) case. DAGS are welll... a bunch of source files. Which API is best to synchronize multiple source files between multiple machines - the answer is simple: GIT. it's the only of the mentioned API that does everything - and additionally allows you to fine-grain access (by mulitple repos and submodules), tracking history, running diffs, has implemented code-review process for those files and much, much more. I was tempted to simply close that issue. But I will convert it into a discussion, I might be biased, and maybe there are other reasons which I have not thought about when thinking "REST API is about the worst solution you can imagine to upload a bunch of DAG files to Airflow". Happy to be challenged on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
