potiuk commented on issue #27476:
URL: https://github.com/apache/airflow/issues/27476#issuecomment-1314042656

   > We default to 60s (which is a mistake imo, PR incoming!). You can't go 
beyond "0" without some risk of task failures from being out of sync
   
   There is no magic, again. shared volumes do none of sync quarantees either 
because there are no atomicity guarantees. Actually it is FAR worse (and our 
users suffered from those).
   
   In most cases of shared volumes (as far as I know) this risk is far greater. 
You can have partially synchronized directories, partially synchronized files - 
when you have Amazon EFS with low iops the consistency of your local version of 
DAG folder is simply impossible - because NFS synchronizes and flushes each 
file separately. With any sizeable shared volume, this problem  is FAR worse 
than that - you can have locally A new dag importing an old version of a 
library. Or a new library imported by an old DAG. And absolutely no control 
over that. You can have arbitrary snapshots of arbitrary versions of files that 
were copied to a shared part of a volume at the other end. This is a major 
source fo instability for the users with EFS and low IOPS and likely it causes 
occasional untraceable errors when race conditions happens even if your IOPS 
are high.
   
   I think effect of that is way worse than "slight" sync delays. Especially if 
you are not aware of that effect.
   
   In case of GitSync because atomic replacement (symbolic link for the whole 
DAG replacement, we at least have guarantee of consistency.
   
   And it gest FAR worse when you combine GitSync + Shared volumes. Far, Far, 
Far Worse, Precistly because GitSync does this atomic replacement. The way how 
GitSync works -  there are points in time where GitSync has exactly TWO FULL 
COPIES of full dag folder. One old and one new. When git sync retrieves a new 
commit, the copy of FULL DAG FOLDER is created and when  git pull is completed 
it replaces a symbolic link to that new copy. That "git sync" process is 
extremely heavy on shared folders. Assume NFS
   
   1) new files are synced (and NFS starts syncing them while git sync works)
   2) once finished (NFS is still syncing) Git replaces the symbolic link to 
the new directory
   3) Now - this means that effectively NFS has to delete all those files from 
old link and replace them with new files from the new link - they are 
effectively different files, with no relation whatsoever so NFS has to 
synchronise all of them again
   4) all this while the files are continously read by multiple ends.
   
   The effect is tht Git Sync even with single line of change causes an 
Avalanche of changes in NFS-based system. Basically whole DAG folder is deleted 
and recreated on remote ends from scratch with every single commit.
   
   Of course various optimisations and versions of NFS and shared volumes have 
some optimisations but none of those is prepared to the scenario that suddenly 
whole huge DAG folder is replaced by another (this is what git sync does with 
every single change coming). 
   
   > Generally, yes, but I'm not ready to say "never". 
   
   Well if I  consider the way how Airlfow Scheduler accesses the files and how 
GitSync atomic commit + SharedVolume cause an avalanche of traffic and 
commmunication and that shared volumes do not provide ANY consistency guarantee 
- I am quite ready to say that "GitSync + Shared Volumes" is never.
   
   > Don't get me wrong though, I'm not going to die on this hill. I'm more of 
a "bake it into the image" + KubernetesExecutor guy anyways man_shrugging.
   
   I am also not dying on that hill. I just very strongly advocate for it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to