potiuk commented on issue #24740:
URL: https://github.com/apache/airflow/issues/24740#issuecomment-1170853057

   
   > 1. It's yet another point of failure, another credential at risk. I'd 
rather restrict source access to my CI servers, which need such privilege 
anyway.
   
   This is falling into fallacy of "tninking that things are magically working" 
in distributed systems. They don't. You already have that credentials in ANY 
scenario here. They are just leaving elsewhere. Airflow is a distributed system 
and uploading dag files via "REST" does not make the credentials needed to 
distribute the files disappear. They still need to be there - just elsewhere.
   
   * With GitSync, S3, Shared filesystem you have either explicit or implicit 
credentials that you need to give to any of the components involved into the 
syncing process. They might be either explicit (provided in GitSync config) or 
S3 env variable or filesystem configuration, or metadata-server in Cloud - 
plenty of options. But this is essentially one computer, talking to other over 
a network. It HAS TO have credentials - no way around it.
   
   * What's worse, with your proposed solution, the REST API will at most 
upload the files to a single machine (say webserver). And that machine has to 
have credentials (either implicit or explicit) to actually WRITE those DAG 
files to all the other components. Not only make it a single point of failure, 
and far more "centralized" system to deal with all the problems of distributing 
the files to potentially multiple physical machines, but it DOES NOT remove the 
problem of having a credential on that machine. What's even more this makes it 
far more dangerous, because now airlflow webserver (or whatever component will 
handle the API has to have "WRITE" access to DAGs. This is SUPER dangerous. In 
the current setup there is no need whatsoever for a write acess to either 
Git/S2/shared filesytems by any airflow components,  they only need READ 
access. And the management of "who" is able to write the DAGs where is 
completely delegated out to a deployment. Airflow has 0 knowldege and impact 
 there (which allows you for example to implement something that Composer team 
just did - per folder write access to sub-folders of DAG. With the REST API of 
Airlfow you essentially introduce all that write access capabilities and whole 
management of it to Airlfow. This is not something we are going to accept I 
think, this is a major change in the whole security model Airlfow is built on.
   
   > 2. There's no easy way of giving fine-grained access to the contents of a 
Git repo. Hence, using `git sync` means giving Airflow acess to entire repos. 
That's specially worrisome for folks using monorepos.
   
   Nope. this is completely wrong assumption. Look at this talk from JAGEX from 
Airflow Summit. They use submodules and implemented a system where they have 
"ONE REPO PER DAG(!)" and can easily manage it using central repo with 
submodules and they are extremely happy with it -  
https://youtu.be/uA-8Lj1RNgw. - look how happy and simling Anum was when he 
explained how it solved all of their problems with managing DAGs at scale with 
multilple teams. Learn from the best and stand on the shoulders of giants, I 
would say.
    
   > 1. It would prevent another component of my infrastructure needing access 
to source.
   
   Instead it forces the airflow component that takes the file in to have write 
access. This is just moving the problem elsewhere, not solving it. You have to 
manage your own deployment anyway, it does not free you from that 
responsibility, By doing it, you are asking your security guys to learn a new 
component (Airlfow API server) which will have a write access to most crucial 
part of your system (DAGS folder) in order to audit, control, and manage it's 
security. You add your security people more work, because they have one more 
system with write access to mange and one they are not familiar with (unlike 
standard Git, Filesystems, S3 where they have all the tools and practices they 
are familiar with. This is asking for trouble. Have you actually asked your 
secuirty people if they would like to learn a security managing around a new 
component they are not familiar with ? The fact is tha you assume that the 
security problem will magically disappear if you move it to another component 
 - but it won't. It will just move elsewhere.
   
   > 2. It would make DAG deployment friendler to CI/CD scenarios, which in 
many organizations are at greater security and operational scrutiny.
   
   Why? I do not understand it.  Can you please explain why this is so ? 
   
   How "cp X Y" or "git push" is more complex than "find and zip all the files, 
make it an attachment and run complex curl command to push the file (and deal 
with all potential errors it might cause). REST API is particularly bad in 
pushing multiple file changes. It is not designed to accept  multiple files. 
you literally have to "compress them" in a single blob and upload or (worse) 
make a complex multi-part-encoded message and handle it somehow on the server 
side. Moreover, if you had really big number of files and only part of the 
files change, you either have to upload all of them or figure out an algorithm 
and API to allow for partial or incremental uploads. All those features you 
need to implement on top of REST API.
   
   On the other hand, git, S3, remote filesystem have fantastic, non-REST APIS 
that are designed to distribute multiple files. They were created with 
precisely this purpose in mind. Heck. Git even has sophisticated algorithms to 
only send incremental changes to literally millions of source files. It's been 
exactly designed to handle that (and only that) case. DAGS are welll... a bunch 
of source files. Which API is best to synchronize multiple source files between 
multiple machines - the answer is simple: GIT. it's the only of the mentioned 
API that does everything - and additionally allows you to fine-grain access (by 
mulitple repos and submodules), tracking history, running diffs, has 
implemented code-review process for those files and much, much more. 
   
   I was tempted to simply close that issue. But I will convert it into a 
discussion, I might be biased, and maybe there are other reasons which I have 
not thought about when thinking "REST API is about the worst solution you can 
imagine to upload a bunch of DAG files to Airflow". Happy to be challenged on 
that.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to