potiuk commented on issue #27476:
URL: https://github.com/apache/airflow/issues/27476#issuecomment-1305868800

   > What do you mean "running git-sync for all the components"?
   
   When persistency is disabled, git sync runs as an side-container for workers 
and scheduler - all of them. They also run as init-containers first to make 
sure that you have fully synced DAG folder before the component starts (when 
you use persistency, this depends on how much distribution you have and your 
remote filesystem could introduce significant latency on distributing files - 
there is no guarantee what your worker sees locally for example. IT could be no 
DAGS if you have race condition where Git sync has not yet completed the update 
in the first component of yours.
   
   > Again, git-sync runs inside scheduler pod. How do you provide DAGs to 
webserver pod and worker pods?
   No. Without persistency git-sync runs as side container where it is needed. 
   Just note that your statement is not accurate - you do not need DAGs in 
webserver in Airflfow 2 (and it neither has persistent volume mapped to it - 
only workers, scheduler, trigerer and dag file processor - if it is run 
standalone) need it. In the latter case (when dag file processor(s) are run as 
standalone) also scheduler does not need the DAG folder.
   
   > Using git-sync in all of them, doesn't solve the "network issue".
   
   It does - it's far less of networking (likely several orders of magnitude in 
fact), and you do not pay for it. Also the problem with shared/persistent 
volumes is that schedulers continuously reads and re-reads the files when 
scheduling. continuously. all the time. No matter if they changed. This 
generates a lot of traffic if you have a lot of files when your files are 
mounted on persistent (i.e. networked) volument - even if files are not 
changed, just acccessing the files takes roundrips to a server and when you 
have a lot of the files this is very costly (in terms of money but also in 
terms of latency - side effect is that picking up new files is severly delayed 
in not even that extreme cases and stability of the connections suffer.. Many 
of our users paid a lot of money for EFS IOPS to overcome latency.
   Git-sync has none of it. The only traffic happens when we a) run single HTTP 
call to check if anything changed every minute or so b) download files when 
they changed. All the scanning by scheduler is done locally. This generates 
orders of magnitude less traffic - simply because reading files by scheduler 
happens thousands of times more frequently than changing the DAGs by the 
authors. Git-sync simply optimises Airflow use case way better. This is all in 
detail explained in my article.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to