Michal Hocko <mho...@kernel.org> writes:
> On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
>> During exec dumpable is cleared if the file that is being executed is
>> not readable by the user executing the file. A bug in
>> ptrace_may_access allows reading the file if the executable happens to
>> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
>> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
>> This problem is fixed with only necessary userspace breakage by adding
>> a user namespace owner to mm_struct, captured at the time of exec,
>> so it is clear in which user namespace CAP_SYS_PTRACE must be present
>> in to be able to safely give read permission to the executable.
>> The function ptrace_may_access is modified to verify that the ptracer
>> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
>> This ensures that if the task changes it's cred into a subordinate
>> user namespace it does not become ptraceable.
> I haven't studied your patch too deeply but one thing that immediately
> raised a red flag was that mm might be shared between processes (aka
> thread groups). What prevents those two to sit in different user
> I am primarily asking because this generated a lot of headache for the
> memcg handling as those processes might sit in different cgroups while
> there is only one correct memcg for them which can disagree with the
> cgroup associated with one of the processes.
That is a legitimate concern, but I do not see any of those kinds of
Part of the memcg pain comes from the fact that control groups are
process centric, and part of the pain comes from the fact that it is
possible to change control groups. What I am doing is making the mm
owned by a user namespace (at creation time), and I am not allowing
changes to that ownership. The credentials of the tasks that use that mm
may be in the same user namespace or descendent user namespaces.
The core goal is to enforce the unreadability of an mm when an
non-readable file is executed. This is a time of mm creation property.
The enforcement of which fits very well with the security/permission
checking role of the user namespace.
Could this use of mm->user_ns be extended for some kind of
accounting/limiting in the future? Possibly. I can imagine a limit on
the total number of page table entries a group of processes are allowed
to have as being a sane kind of limit in this setting much like
RLIMIT_AS is sane on a single mm level. Pages don't belong to mm's so I
can't imagine anything like the memcg being built on this kind of