potiuk commented on pull request #21145: URL: https://github.com/apache/airflow/pull/21145#issuecomment-1038178444
> @potiuk Could you explain more about this part? I could understand it fix the permission issue, but I don't understand what problems it is causing and why it has to be fixed this way. Also do we need to implement the same by running the command in subprocess module python? https://github.com/apache/airflow/blob/cca2f942ea9363303a041f00e0f180924cb840a9/scripts/ci/libraries/_permissions.sh#L19-L53 The problem to solve is very subtle. It is a result of a few combined issues. 1) Git does not store "group write" permissions for files you put in the repository. This is a historical decision also coming from the fact that "group" permissions only make sense for POSIX compliant systems (So not for Windows) 2) When the files are checked out, by default git uses "umask" of the current system to determine whether the group permissions to write are set or not https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresharedRepository (that's the default behaviour when 'false' is set. 3) Unfortunately POSIX does not define what should be the default umask, It usually is either `0022` or `0002` - depending on the system. This basically mean that your file when you check it out with git will be "group read" when umask is `0022` but will be "group read-write" when umask is `0002'. More about umask and difference between those two setttings can be found here: https://www.cyberciti.biz/tips/understanding-linux-unix-umask-value-usage.html So when you check-out airflow project on `0022` system you will get this: ``` ls -la scripts/ci total 132 drwxr-xr-x 19 jarek jarek 4096 Feb 5 20:49 . drwxr-xr-x 8 jarek jarek 4096 Feb 5 20:49 .. drwxr-xr-x 2 jarek jarek 4096 Feb 5 20:49 build_airflow ``` But when you check-out airflow projecet on `0002` umask system, you will get this: (Note "w" in the group permission part): ``` ls -la scripts/ci total 132 drwxrwxr-x 19 jarek jarek 4096 Feb 5 20:49 . drwxrwxr-x 8 jarek jarek 4096 Feb 5 20:49 .. drwxrwxr-x 2 jarek jarek 4096 Feb 5 20:49 build_airflow ``` I think pretty much all Linux distributions use `0022` umask for root. But when it comes to "regular" users, it is different - Ubuntu up until recently used "0002" for regular users, and Debian/Mint "0022". The result is - we cannot rely on the "w" bit being set for files when users checks it out - it might or might not be set and it's beyond of our control when `git checkout` is executed. On it's own - this is not a problem, but it is a HUGE problem when it comes to caching Docker builds. The problem is, that when you build Docker image, docker actually uses the permissions on the file to determine whether a file changed or not. In the case above. those two "build_airflow" files will have identical content but Docker finds them "different" (becaus of the "w" permission). In this case whenever in Dockerfile we will use: ``` COPY scripts/ci/build_airlfow /opt/airflow/scripts/ci/build ``` Docker "thinks" that the file has changed and will simply invalidate the cache. So when we use remote cache (as we do) this means that docker will always rebuild the docker image from that step, because it will not realise that this is in fact the same file. This is a huge problem in our case, because in order to speed up local rebuilds of Breeze image. we use remote cache and we have to make sure that the file that was used to build airflow will be properly seen as "unchanged" by anyone who builds the image locally - in order to heavily optimize the build time (pulling cached layers is usually much faster than rebuilding them - especially that rebuild often involves pulling `pip` packages and `apt` packages. That's why we fix the problem just before building. We basically find all files and directories which are present in git repository and that have "w" bit set and remove the bit. After this operation none of the files that are going to be used for docker build will have the "w" bit set for them - and the cache is invalidated. And yeah. We have to do equivalent of this in Python. Submodule is one way but we could also work out similar approach using Pyhon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
