> On Jan. 14, 2019, 4:31 p.m., Qian Zhang wrote:
> > src/linux/seccomp/seccomp.cpp
> > Lines 137-139 (patched)
> > <https://reviews.apache.org/r/68018/diff/14/?file=2117423#file2117423line137>
> >
> >     Will this affect the task run by Mesos? E.g., a task may want to run a 
> > program which has `set-user-ID` bit.
> 
> Andrei Budnik wrote:
>     Yes, `no_new_privs` flag affects the task that wants to run a program 
> which has `set-user-ID` bit.
>     E.g., launching a `ping -c 3 8.8.8.8` fails with seccomp. You'll see a 
> message in executor logs:
>     ```
>     I0114 07:19:21.887670 13264 executor.cpp:706] Forked command at 13276
>     ping: socket: Operation not permitted
>     I0114 07:19:22.055352 13263 executor.cpp:1007] Command exited with status 
> 2 (pid: 13276)
>     ```
>     
>     Also, see my previous comment 
> https://reviews.apache.org/r/68018/#comment297000
> 
> Qian Zhang wrote:
>     In your previous comment, you mentioned that Docker daemon launches its 
> containers with `SCMP_FLTATR_CTL_NNP` flag set by default, does that mean any 
> containers launched by Docker daemon cannot run program which has set-user-ID 
> bit?
>     
>     This seems unfortunate since it might break some use cases or 
> applications that we already supported. And can you please elaborate a bit 
> about `"Disabling SCMP_FLTATR_CTL_NNP flag for a root means that Seccomp 
> filter can be reverted anytime"`? How will the Seccomp filter be reverted? Do 
> you mean the task launched by Mesos can call libseccomp API to revert the 
> filter itself?
>     
>     If we have to live with this limitation (i.e., cannot run program which 
> has set-user-ID bit), then we need to highlight it in the document.
> 
> Gilbert Song wrote:
>     Seems like we asked the same question.
>     
>     Andrei, let align on this thread? :/thanks:)
> 
> Andrei Budnik wrote:
>     >does that mean any containers launched by Docker daemon cannot run 
> program which has set-user-ID bit?
>     
>     Docker daemon can not be used to run arbitrary programs (in opposity to 
> Mesos c'zer). So, when one launches a Docker container, Docker daemon 
> launches a container process with `NNP` bit set, which means that a container 
> process (and it descendants) can't gain more previleges **outside** its 
> container. Mesos containerizer has exactly the same behaviour:
>     
>     1) Run system-provided `/bin/ping` (*outside* its container) as a 
> non-privileged user:
>     ```
>     $ ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=mesos --command="ping -c 3 8.8.8.8"
>     ...
>     Received status update TASK_FAILED for task 'a'
>       message: 'Command exited with status 2'
>       source: SOURCE_EXECUTOR
>     ```
>     
>     2) Run system-provided `/bin/ping` (*outside* its container) as a 
> privileged user:
>     ```
>     sudo ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=mesos --command="ping -c 3 8.8.8.8"
>     ...
>     Received status update TASK_FINISHED for task 'a'
>       message: 'Command exited with status 0'
>       source: SOURCE_EXECUTOR
>     ```
>     
>     3) Run container image provided `ping` (*inside* its image/container) as 
> a non-privileged user:
>     ```
>     $ ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=mesos --docker_image="fedora:latest" --command="yum -y 
> install iputils;ping -c 3 8.8.8.8"
>     ...
>     Received status update TASK_FINISHED for task 'a'
>       message: 'Command exited with status 0'
>       source: SOURCE_EXECUTOR
>     
>     $ cat /path/to/container/stdout
>     ...
>     PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
>     64 bytes from 8.8.8.8: icmp_seq=1 ttl=122 time=13.9 ms
>     ```
>     
>     > This seems unfortunate since it might break some use cases or 
> applications that we already supported.
>     
>     It's very unlikely that the agent launches tasks, whose binary has 
> `setuid`/`setgid` bit specified. Because... what the point?
>     I doubt if any of the following programs a launched as a Mesos container:
>     ```
>     $ sudo find /bin/ -perm -u=s -type f 2>/dev/null
>     /bin/newgrp
>     /bin/pkexec
>     /bin/mount
>     /bin/umount
>     /bin/newuidmap
>     /bin/newgidmap
>     /bin/sudo
>     /bin/crontab
>     /bin/su
>     /bin/gpasswd
>     /bin/chage
>     /bin/passwd
>     /bin/staprun
>     /bin/fusermount
>     /bin/fusermount-glusterfs
>     /bin/chfn
>     /bin/chsh
>     /bin/at
>     ```
>     
>     > And can you please elaborate a bit about "Disabling SCMP_FLTATR_CTL_NNP 
> flag for a root means that Seccomp filter can be reverted anytime"? How will 
> the Seccomp filter be reverted? Do you mean the task launched by Mesos can 
> call libseccomp API to revert the filter itself?
>     
>     Yes, without `NNP` (`no_new_privs`) bit set, a privileged task might call 
> `seccomp` Linux syscall to install an empty Seccomp filter.
> 
> Qian Zhang wrote:
>     > Run system-provided /bin/ping (outside its container) as a 
> non-privileged user:
>     
>     As you mentioned in the above comment, this task will fail, but that's 
> **after** your seccomp patches are applied. Before your seccomp patches are 
> applied (e.g., I am using the latest code in Mesos master branch), it will 
> succeed:
>     ```
>     $ ./src/mesos-execute --master=192.168.56.5:5050 --name=test 
> --command="ping -c 3 8.8.8.8" --checkpoint  
>     I0116 10:15:02.699398 14271 scheduler.cpp:189] Version: 1.8.0
>     I0116 10:15:02.977327 14287 scheduler.cpp:355] Using default 'basic' HTTP 
> authenticatee
>     I0116 10:15:02.979837 14285 scheduler.cpp:538] New master detected at 
> master@192.168.56.5:5050
>     Subscribed with ID ea9488e1-a171-423f-8eb5-4d70187349fb-0001
>     Submitted task 'test' to agent '12866186-dc2b-48a9-88ad-f9d951cf8c7f-S0'
>     Received status update TASK_STARTING for task 'test'
>       source: SOURCE_EXECUTOR
>     Received status update TASK_RUNNING for task 'test'
>       source: SOURCE_EXECUTOR
>     Received status update TASK_FINISHED for task 'test'
>       message: 'Command exited with status 0'
>       source: SOURCE_EXECUTOR
>     ```
>     To me, this is kind of feature broken, i.e., some previously supported 
> user cases or applications will fail after your seccomp patches are applied.
>     
>     > when one launches a Docker container, Docker daemon launches a 
> container process with NNP bit set, which means that a container process (and 
> it descendants) can't gain more previleges outside its container.
>     
>     This seems not what I found with Docker. I created a Docker image with 
> ping installed and a non-root user added:
>     ```
>     FROM ubuntu:18.04
>     
>     RUN apt-get update && apt-get install -y iputils-ping
>     RUN adduser --disabled-password --gecos "" stack
>     ```
>     
>     And then I created a Docker container from that image with the non-root 
> user, and I found ping worked.
>     ```
>     docker run --rm -it --user stack ubuntu:stack sh   
>     $ id 
>     uid=1000(stack) gid=1000(stack) groups=1000(stack)
>     $ ls -la /bin/ping 
>     -rwsr-xr-x. 1 root root 64424 Mar  9  2017 /bin/ping
>     $ ping 8.8.8.8 
>     PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
>     64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=3.25 ms
>     64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=3.20 ms
>     64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=3.48 ms
>     ^C
>     --- 8.8.8.8 ping statistics ---
>     3 packets transmitted, 3 received, 0% packet loss, time 2002ms
>     rtt min/avg/max/mdev = 3.200/3.312/3.481/0.121 ms
>     ```
>     So Docker daemon actually can create a container to run the program which 
> has set-user-ID bit, I am a bit confused what is the impact of 
> `SCMP_FLTATR_CTL_NNP` flag which is set by Docker daemon for its containers 
> as you mentioned.
> 
> Andrei Budnik wrote:
>     The example you have provided with Docker daemon is identical to the 3rd 
> case from my previous comment:
>     ```
>     $ ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=mesos --docker_image="fedora:latest" --command="yum -y 
> install iputils;ping -c 3 8.8.8.8"
>     ```
>     We behave in this case exactly as Docker.
>     
>     At the same time first two cases are not supported by Docker, but 
> supported by Mesos containerizer. Hence, the difference in behaviour.
>     
>     Anyway, we need to set `NNP` bit both for a non-privileged user 
> (otherwise, we have no permissions to install Seccomp filter - more details 
> in seccomp man page) and for privileged user (otherwise, it does not make 
> sense to install a Seccomp filter as it can be easily reverted later).
> 
> Andrei Budnik wrote:
>     I will highlight this nuance in Seccomp documentation.
> 
> Andrei Budnik wrote:
>     Added a note in the Seccomp doc: 
> https://reviews.apache.org/r/69493/diff/4-5/

> The example you have provided with Docker daemon is identical to the 3rd case 
> from my previous comment:

I think they are different, the `ping` binary I installed with the `ubuntu` 
image has the set-user-ID bit, but the `ping` binary you installed with the 
`fedora` image has **no** set-user-ID bit. So my example proves Docker daemon 
actually can create a container with a non-root user to run a program which has 
set-user-ID bit. Can you please try your 3rd case with the `ubunut` image? If 
it fails, then I think that's not acceptable since the same use case can be 
supported by Docker but not by us.

> Added a note in the Seccomp doc: https://reviews.apache.org/r/69493/diff/4-5/

I see you added the statement below in the doc:
```
So, when a framework wants to launch an OS-provided `ping` task as a 
non-privileged user, the task will fail.
```
My concern is, when a framework wants to launch an image-provided (e.g., ubuntu 
image) `ping` task as non-privileged user, will the task fail too? And why do 
we need to care about OS-provided and image-provided? I think the point should 
be whether the binary (no matter it is OS-provided or image-provided) that the 
task will execute has set-user-ID bit or not, right?

> So, when one launches a Docker container, Docker daemon launches a container 
> process with NNP bit set

This seems not what I found with Docker:
```
$ docker run --rm --user operator alpine sleep 1000
$ ps -ef | grep sleep 
stack    25409 23826  0 10:49 pts/0    00:00:00 docker run --rm --user operator 
alpine sleep 1000
11       25478 25455  0 10:49 ?        00:00:00 sleep 1000
$ cat /proc/25478/status | grep NoNewPrivs
NoNewPrivs:     0
```
So as you see, the NNP bit is **not** set for the container process. I think it 
will only be set when one specifies `--security-opt="no-new-privileges:true"` 
when launching a Docker container.


> we need to set NNP bit both for a non-privileged user (otherwise, we have no 
> permissions to install Seccomp filter - more details in seccomp man page)

Can you please elaborate a bit why Seccomp filter cannot be installed for a 
non-privileged user if NNP bit is not set? That seems not true for Docker, 
Docker daemon can install the Seccomp filter defined in the default Seccomp 
profile without NNP bit set. I can create a Docker container successfully with 
the command like `"docker run --rm -it --user operator --security-opt 
seccomp=/home/stack/workspace/mesos/build/default.json 
--security-opt="no-new-privileges:false" alpine sh"`.


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68018/#review211946
-----------------------------------------------------------


On Nov. 8, 2018, 11:24 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68018/
> -----------------------------------------------------------
> 
> (Updated Nov. 8, 2018, 11:24 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, James Peach, and Qian Zhang.
> 
> 
> Bugs: MESOS-9034
>     https://issues.apache.org/jira/browse/MESOS-9034
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `SeccompFilter` class is a wrapper for `libseccomp` API. Its main
> purpose is to provide a translation of the `ContainerSeccompProfile`
> message into calls of `libseccomp` API.
> 
> 
> Diffs
> -----
> 
>   src/CMakeLists.txt a574d449dc26b820cbef7ff0b5e94b42b6fe86cf 
>   src/Makefile.am cd785255fcdf1302a8f9fa358039e5d1f200e132 
>   src/linux/seccomp/seccomp.hpp PRE-CREATION 
>   src/linux/seccomp/seccomp.cpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/68018/diff/16/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>

Reply via email to