potiuk commented on pull request #21145:
URL: https://github.com/apache/airflow/pull/21145#issuecomment-1038178444


   > @potiuk Could you explain more about this part? I could understand it fix 
the permission issue, but I don't understand what problems it is causing and 
why it has to be fixed this way. Also do we need to implement the same by 
running the command in subprocess module python? 
   
   
https://github.com/apache/airflow/blob/cca2f942ea9363303a041f00e0f180924cb840a9/scripts/ci/libraries/_permissions.sh#L19-L53
   
   The problem to solve is very subtle. It is a result of a few combined 
issues. 
   
   1) Git does not store "group write" permissions for files you put in the 
repository. This is a historical decision also coming from the fact that 
"group" permissions only make sense for POSIX compliant systems (So not for 
Windows)
   
   2) When the files are checked out, by default git uses "umask" of the 
current system to determine whether the group permissions to write are set or 
not 
https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresharedRepository
  (that's the default behaviour when 'false' is set.
   
   3) Unfortunately POSIX does not define what should be the default umask, It 
usually is either `0022` or `0002` - depending on the system. This basically 
mean that your file when you check it out with git will be "group read" when 
umask is `0022` but will be "group read-write" when umask is `0002'. More about 
umask and difference between those two setttings can be found here: 
https://www.cyberciti.biz/tips/understanding-linux-unix-umask-value-usage.html
   
   So when you check-out airflow project on `0022` system you will get this:
   ```
   ls -la scripts/ci 
   total 132
   drwxr-xr-x 19 jarek jarek  4096 Feb  5 20:49 .
   drwxr-xr-x  8 jarek jarek  4096 Feb  5 20:49 ..
   drwxr-xr-x  2 jarek jarek  4096 Feb  5 20:49 build_airflow
   ```
   
   But when you check-out airflow projecet on `0002` umask system, you will get 
this: (Note "w" in the group permission part):
   ```
   ls -la scripts/ci 
   total 132
   drwxrwxr-x 19 jarek jarek  4096 Feb  5 20:49 .
   drwxrwxr-x  8 jarek jarek  4096 Feb  5 20:49 ..
   drwxrwxr-x  2 jarek jarek  4096 Feb  5 20:49 build_airflow
   ```
   
   I think pretty much all Linux distributions use `0022` umask for root. But 
when it comes to "regular" users, it is different - Ubuntu  up until recently 
used "0002" for regular users, and Debian/Mint "0022". The result is - we 
cannot rely on the "w" bit being set for files when users checks it out - it 
might or might not be set and it's beyond of our control when `git checkout` is 
executed.
   
   On it's own - this is not a problem, but it is a HUGE problem when it comes 
to caching Docker builds. The problem is, that when you build Docker image, 
docker actually uses the permissions on the file to determine whether a file 
changed or not. In the case above. those two "build_airflow" files will have 
identical content but Docker finds them "different"  (becaus of the "w" 
permission). 
   
   In this case whenever in Dockerfile we will use:
   
   ```
   COPY scripts/ci/build_airlfow /opt/airflow/scripts/ci/build
   ```
   
   Docker "thinks" that the file has changed and will simply invalidate the 
cache. So when we use remote cache (as we do) this means that docker will 
always rebuild the docker image from that step, because it will not realise 
that this is in fact the same file.  This is a huge problem in our case, 
because in order to speed up local rebuilds of Breeze image. we use remote 
cache and we have to make sure that the file that was used to build airflow 
will be properly seen as "unchanged" by anyone who builds the image locally - 
in order to heavily optimize the build time (pulling cached layers is usually 
much faster than rebuilding them - especially that rebuild often involves 
pulling `pip` packages and `apt` packages.
   
   That's why we fix the problem just before building. We basically find all 
files and directories which are present in git repository and that have "w" bit 
set and remove the bit. After this operation none of the files that are going 
to be used for docker build will have the "w" bit set for them - and the cache 
is invalidated.
   
   And yeah. We have to do equivalent of this in Python. Submodule is one way 
but we could also work out similar approach using Pyhon.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to