potiuk edited a comment on pull request #21145:
URL: https://github.com/apache/airflow/pull/21145#issuecomment-1024884489


   > @potiuk Could you share your views about getting host user id and group 
id? Do we have to find its equivalent in windows to make this work? Also, can 
you explain to me more about why we need to find the host user id and group id 
and how it's used? There are a few comments about it in the code, but I 
couldn't fully understand it.
   
   This is only really needed on Linux. On Windows/MacOS they are not needed 
and can be empty (I believe - to be checked).
   
   The reason is that on Linux files that are mounted from the host to the 
container are mounted using native filesystem. This basically means that any 
file created inside the container will keep the userid /group id that are used 
in container also in the host.
   
   For example if we have user id 50001 and group id 501 in the container, any 
file we create in container will keep the same user is and group id in the 
host. But those user/group ids might not exist in the Host. - if we create a 
user 50001 in the container, the id will remain like that on the host, when we 
exit from the container. This is very problematic on linux because when we map 
"logs" directory and some logs (and directories) are created there, they might 
be owned by a non-existing user after we exit. And we want to be able to see 
the logs outside of the container because that's where we usually have IDE and 
that's where we keep reading those and analyse them. 
   
   Then, the problem is that if you want to delete such folders and files, you 
need to use `sudo` in the host, because your regular user has no access to it. 
This is  big problem especially if files are created inside your source 
directory (which is also mounted to the container) - for example it will 
prevent you from switching branches easily because git will not be able to 
remove some files and it will refuse to switch branches.
   
   There is also "reverse" problem - if you create files in a host with no 
"all" permissions, and you mount them inside the container, and container runs 
as "different" user, the user in container cannot  access to those file (unless 
you run as root inside the container - root inside the container is equivalent 
to root in host and can access and update all files). 
   
   This can be mitigated by "user remapping" - 
https://docs.docker.com/engine/security/userns-remap/  - but this can only be 
configure at the "docker daemon" level, and this is something we should not 
require an average user should do, also the problem with user remapping is that 
it is "global" setting. It will remap your user for all containers and in many 
cases this is not what you really want. 
   
   So In order to avoid that we have a few things:
   
   a) We use `root` user in container - all the files are created and run as 
root user. This is not recommended for production but it is great for CI - 
because you can freely create and read any mounted files (no matter what user), 
you can also run pip/apt etc. without sudo and it is generally much more 
convenient for many development tasks. The side effect of that is that all 
files created in the container have root user/group set.
   
   b) we pass HOST_USER_ID and HOST_GROUP_ID to the container, so that we know 
who is the user on the host. Depending on lthe linux distro and even depending 
on your configuration (how many users you have created and in which sequence) - 
the UID can be different.
   
   c) when the user enters the container, we set a "trap"L `add_trap 
"in_container_fix_ownership" EXIT HUP INT TERM` - this trap runs 
"fix_ownership" script that looks for all created files in the directories 
where we expect we will create files:
   
   ```
               "/files"
               "/root/.aws"
               "/root/.azure"
               "/root/.config/gcloud"
               "/root/.docker"
               "/opt/airflow/logs"
               "/opt/airflow/docs"
               "/opt/airflow/dags"
               "${AIRFLOW_SOURCES}"
   ```
   
   Whenever we exit, or terminate the container, this script is executed and it 
finds all files owned by "root" in those directories and changes their 
ownership to be HOST_USER/HOST_GROUP.  This way when you exit the containers on 
linux, the files are owned by the host user, and can be easily deleted - either 
manually or when you change branches.
   
   On MacOS and Windows this is not needed. Both MacOS and Windows use 
"user-space" filesystems to mount files. The filestystems are far slower than 
the native filesystem (many times actually) - which impacts the speed of 
runnign airlfow in docker container on MacOS and Windows. However they 
automatically remap the user - all the files created inside the containers are 
automatically remapped to have the "host" user ownership and there is no need 
to fix the ownership for those cases. 
   
   
   I hope it is clearer now. I will create an ADR out of that comment :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to