potiuk commented on issue #27476: URL: https://github.com/apache/airflow/issues/27476#issuecomment-1305868800
> What do you mean "running git-sync for all the components"? When persistency is disabled, git sync runs as an side-container for workers and scheduler - all of them. They also run as init-containers first to make sure that you have fully synced DAG folder before the component starts (when you use persistency, this depends on how much distribution you have and your remote filesystem could introduce significant latency on distributing files - there is no guarantee what your worker sees locally for example. IT could be no DAGS if you have race condition where Git sync has not yet completed the update in the first component of yours. > Again, git-sync runs inside scheduler pod. How do you provide DAGs to webserver pod and worker pods? No. Without persistency git-sync runs as side container where it is needed. Just note that your statement is not accurate - you do not need DAGs in webserver in Airflfow 2 (and it neither has persistent volume mapped to it - only workers, scheduler, trigerer and dag file processor - if it is run standalone) need it. In the latter case (when dag file processor(s) are run as standalone) also scheduler does not need the DAG folder. > Using git-sync in all of them, doesn't solve the "network issue". It does - it's far less of networking (likely several orders of magnitude in fact), and you do not pay for it. Also the problem with shared/persistent volumes is that schedulers continuously reads and re-reads the files when scheduling. continuously. all the time. No matter if they changed. This generates a lot of traffic if you have a lot of files when your files are mounted on persistent (i.e. networked) volument - even if files are not changed, just acccessing the files takes roundrips to a server and when you have a lot of the files this is very costly (in terms of money but also in terms of latency - side effect is that picking up new files is severly delayed in not even that extreme cases and stability of the connections suffer.. Many of our users paid a lot of money for EFS IOPS to overcome latency. Git-sync has none of it. The only traffic happens when we a) run single HTTP call to check if anything changed every minute or so b) download files when they changed. All the scanning by scheduler is done locally. This generates orders of magnitude less traffic - simply because reading files by scheduler happens thousands of times more frequently than changing the DAGs by the authors. Git-sync simply optimises Airflow use case way better. This is all in detail explained in my article. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
