from:"Eric W. Biederman"

Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> I am going to stop there.  I believe there are more issues in the code.
> I am relieved that I am not seeing the loss of some of the security
> hooks that I thought I saw last time I looked at the code.

Bah.  Now I see the missing security hook.

There are a set of security hooks that allow security modules to parse
mount options.

On a good day they look like:

security_mnt_opts opts;
char *secdata;

secdata = alloc_secdata();
security_sb_copy_data("a,mount,options,string", secdata);

security_init_mnt_opts();
security_parse_opts_str(secdata, );
security_set_mnt_opts(sb, , 0, NULL);
security_free_mnt_opts();

In practice however things are not that explicit.  With
security_sb_kern_mount performing all of the mnt_opts work.

However after the rewrite in the patchset.

The function sb_kern_mount no longer exists and it's replacement
sb_get_tree out of necessity does not call parse_opts_str.  This is
because the mount options can no longer be passed as a string.

The legacy compatibility code also does not call sb_parse_opts_str.

The result is using the existing apis all of the security module command
line parsing except for (btrfs and nfs) no longer works.

The changes are not structured in a way that makes any of this easy to
find.  Which is why I have been saying I wouldn't do it that way.  It
also is the case that this pattern repeats through out the patches.
Replacing code with something brand new, instead of evolving what is
there.  That makes it easy for this kind of thing to slip through.

Eric

Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> I am going to stop there.  I believe there are more issues in the code.
> I am relieved that I am not seeing the loss of some of the security
> hooks that I thought I saw last time I looked at the code.

Bah.  Now I see the missing security hook.

There are a set of security hooks that allow security modules to parse
mount options.

On a good day they look like:

security_mnt_opts opts;
char *secdata;

secdata = alloc_secdata();
security_sb_copy_data("a,mount,options,string", secdata);

security_init_mnt_opts();
security_parse_opts_str(secdata, );
security_set_mnt_opts(sb, , 0, NULL);
security_free_mnt_opts();

In practice however things are not that explicit.  With
security_sb_kern_mount performing all of the mnt_opts work.

However after the rewrite in the patchset.

The function sb_kern_mount no longer exists and it's replacement
sb_get_tree out of necessity does not call parse_opts_str.  This is
because the mount options can no longer be passed as a string.

The legacy compatibility code also does not call sb_parse_opts_str.

The result is using the existing apis all of the security module command
line parsing except for (btrfs and nfs) no longer works.

The changes are not structured in a way that makes any of this easy to
find.  Which is why I have been saying I wouldn't do it that way.  It
also is the case that this pattern repeats through out the patches.
Replacing code with something brand new, instead of evolving what is
there.  That makes it easy for this kind of thing to slip through.

Eric

Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman

Al Viro  writes:

>   mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Absolutely not.

My objections fundamentally is that I can find real problems when I look
at the code.  Further the changes have not been incremental changes that
have evolved the code from one state to another but complete
replacements of code that make code review very difficult and bisection
completely inapplicable.

I also object that this series completely fails to fix the worst but I
have ever seen in the mount API.  Whit no real intrest shown in working
to fix it.

A couple of bugs that I can see quickly.  Several of which I have
previously reported:

- There is an easily triggered NULL pointer deference with open_tree
  and mount propagation.

- Bisection will not work with the cpuset filesystem patch.  At least
  cpuset looks like it may be mountable now.

- The setting of fc->user_ns on proc remains broken.  In particular if you
  create a child user namespace and attempt to mount proc it will succeed
  instead of fail.

- The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I suspect I didn't report it well but I reported both the proc and the
mqueue filesystem weeks before the merge window opened.

I am going to stop there.  I believe there are more issues in the code.
I am relieved that I am not seeing the loss of some of the security
hooks that I thought I saw last time I looked at the code.

Given both that I have reported bugs that remain unfixed and it's
non-evolutionary nature that makes this patchset hard to review I have
no confidence in this set of patches.

I think the scope has been too large for anyone to properly review the
code and it should be broken into smaller pieces.  That there were
significant open_tree and move_mount issues being found just before the
merge window opened seems a testament to that.  (AKA the first 3 or so
patches in the patchset and fundamental to the rest).

My apologies that we have not been able communication better and that I
have not been able to clearly point out all of the bugs.   My apologies
I have not yet been able to give more constructive feedback.

Still at the end of the day there remain regressions of existing
functionality in that tree.

Eric

Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman

Al Viro  writes:

>   mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Absolutely not.

My objections fundamentally is that I can find real problems when I look
at the code.  Further the changes have not been incremental changes that
have evolved the code from one state to another but complete
replacements of code that make code review very difficult and bisection
completely inapplicable.

I also object that this series completely fails to fix the worst but I
have ever seen in the mount API.  Whit no real intrest shown in working
to fix it.

A couple of bugs that I can see quickly.  Several of which I have
previously reported:

- There is an easily triggered NULL pointer deference with open_tree
  and mount propagation.

- Bisection will not work with the cpuset filesystem patch.  At least
  cpuset looks like it may be mountable now.

- The setting of fc->user_ns on proc remains broken.  In particular if you
  create a child user namespace and attempt to mount proc it will succeed
  instead of fail.

- The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I suspect I didn't report it well but I reported both the proc and the
mqueue filesystem weeks before the merge window opened.

I am going to stop there.  I believe there are more issues in the code.
I am relieved that I am not seeing the loss of some of the security
hooks that I thought I saw last time I looked at the code.

Given both that I have reported bugs that remain unfixed and it's
non-evolutionary nature that makes this patchset hard to review I have
no confidence in this set of patches.

I think the scope has been too large for anyone to properly review the
code and it should be broken into smaller pieces.  That there were
significant open_tree and move_mount issues being found just before the
merge window opened seems a testament to that.  (AKA the first 3 or so
patches in the patchset and fundamental to the rest).

My apologies that we have not been able communication better and that I
have not been able to clearly point out all of the bugs.   My apologies
I have not yet been able to give more constructive feedback.

Still at the end of the day there remain regressions of existing
functionality in that tree.

Eric

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman

Christian Brauner  writes:

> On Tue, Oct 30, 2018 at 12:12 PM Daniel Colascione  wrote:
>>
>> On Tue, Oct 30, 2018 at 11:04 AM, Christian Brauner
>>  wrote:
>> > On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione  
>> > wrote:
>> >>
>> >> Why not?
>> >>
>> >> Does your proposed API allow for a race-free pkill, with arbitrary
>> >> selection criteria? This capability is a good litmus test for fixing
>> >> the long-standing Unix process API issues.
>> >
>> > You'd have a handle on the process with an fd so yes, it would be.
>>
>> Thanks. That's good to hear.
>>
>> Any idea on the timetable for this proposal? I'm open to lots of
>> alternative technical approaches, but I don't want this capability to
>> languish for a long time.
>
> Latest end of year likely sooner depending on the feedback I'm getting
> during LPC.

Frankly.  If you want a race free fork variant probably the easiest
thing to do is to return a open copy of the proc directory entry.  Wrapped
in a bind mount so that you can't see beyond that directory in proc.

My only concern would be if a vfsmount per process would be too heavy for such 
a use.

Eric

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman

Christian Brauner  writes:

> On Tue, Oct 30, 2018 at 12:12 PM Daniel Colascione  wrote:
>>
>> On Tue, Oct 30, 2018 at 11:04 AM, Christian Brauner
>>  wrote:
>> > On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione  
>> > wrote:
>> >>
>> >> Why not?
>> >>
>> >> Does your proposed API allow for a race-free pkill, with arbitrary
>> >> selection criteria? This capability is a good litmus test for fixing
>> >> the long-standing Unix process API issues.
>> >
>> > You'd have a handle on the process with an fd so yes, it would be.
>>
>> Thanks. That's good to hear.
>>
>> Any idea on the timetable for this proposal? I'm open to lots of
>> alternative technical approaches, but I don't want this capability to
>> languish for a long time.
>
> Latest end of year likely sooner depending on the feedback I'm getting
> during LPC.

Frankly.  If you want a race free fork variant probably the easiest
thing to do is to return a open copy of the proc directory entry.  Wrapped
in a bind mount so that you can't see beyond that directory in proc.

My only concern would be if a vfsmount per process would be too heavy for such 
a use.

Eric

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman

Aleksa Sarai  writes:

> On 2018-10-29, Daniel Colascione  wrote:
>> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> write the signal number in base-10 ASCII to the kill file of the
>> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>> 
>> Semantically, /proc/pid/kill works like kill(2), except that the
>> process ID comes from the proc filesystem context instead of from an
>> explicit system call parameter. This way, it's possible to avoid races
>> between inspecting some aspect of a process and that process's PID
>> being reused for some other process.
>
> (Aside from any UX concerns other folks might have.)
>
> I think it would be a good idea to (at least temporarily) restrict this
> so that only processes that are in the same PID namespace as the /proc
> being resolved through may use this interface. Otherwise you might have
> cases where partial container breakouts can start sending signals to
> PIDs they wouldn't normally be able to address.

No.  That is the container managers job.  If you have the wrong proc
mounted in your container or otherwise allow access to it that is the
fault of the application that set up the container.

The pid namespace limits visibility.  If something becomes visible and
you have permissions over it, it is perfectly reasonable for you to
execute those permissions.

Eric

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman

Aleksa Sarai  writes:

> On 2018-10-29, Daniel Colascione  wrote:
>> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> write the signal number in base-10 ASCII to the kill file of the
>> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>> 
>> Semantically, /proc/pid/kill works like kill(2), except that the
>> process ID comes from the proc filesystem context instead of from an
>> explicit system call parameter. This way, it's possible to avoid races
>> between inspecting some aspect of a process and that process's PID
>> being reused for some other process.
>
> (Aside from any UX concerns other folks might have.)
>
> I think it would be a good idea to (at least temporarily) restrict this
> so that only processes that are in the same PID namespace as the /proc
> being resolved through may use this interface. Otherwise you might have
> cases where partial container breakouts can start sending signals to
> PIDs they wouldn't normally be able to address.

No.  That is the container managers job.  If you have the wrong proc
mounted in your container or otherwise allow access to it that is the
fault of the application that set up the container.

The pid namespace limits visibility.  If something becomes visible and
you have permissions over it, it is perfectly reasonable for you to
execute those permissions.

Eric

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman

Daniel Colascione  writes:

> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
>[[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>

In general this looks good.

Unfortunately the permission checks are are subject to a serious
problem.  Even if I don't have permission to kill a process I quite
likely will be allowed to open the file.

Then I just need to find a setuid or setcap executable will write to
stdout or stderr a number.  Then I have killed something I should not
have the privileges to kill.

At a bare minimum you need to perform the permission check using the
credentials of the opener of the file.Which means refactoring
kill_pid so that you can perform the permission check for killing the
application during open.  Given that process credentials can change
completely during exec you also need to rule out a change in process
credentials making it so that the original opener of the file would not
be able to kill the process as it is now.

But overall this looks quite reasaonble.

Eric

> Signed-off-by: Daniel Colascione 
> ---
>  fs/proc/base.c | 39 +++
>  1 file changed, 39 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..923d62b21e67 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct 
> path *path)
>   return result;
>  }
>  
> +static ssize_t proc_pid_kill_write(struct file *file,
> +const char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';
> + res = kstrtoint(strstrip(buffer), 10, );
> + if (res)
> + goto out;
> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;
> + res = count;
> +out:
> + return res;
> +
> +}
> +
> +static const struct file_operations proc_pid_kill_ops = {
> + .write  = proc_pid_kill_write,
> +};
> +
>  static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
> size_t count, loff_t *ppos)
>  {
> @@ -2935,6 +2973,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
>   ONE("syscall",S_IRUSR, proc_pid_syscall),
>  #endif
> + REG("kill",   S_IRUGO | S_IWUGO, proc_pid_kill_ops),
>   REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
>   ONE("stat",   S_IRUGO, proc_tgid_stat),
>   ONE("statm",  S_IRUGO, proc_pid_statm),

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Eric W. Biederman

Daniel Colascione  writes:

> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
>[[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>

In general this looks good.

Unfortunately the permission checks are are subject to a serious
problem.  Even if I don't have permission to kill a process I quite
likely will be allowed to open the file.

Then I just need to find a setuid or setcap executable will write to
stdout or stderr a number.  Then I have killed something I should not
have the privileges to kill.

At a bare minimum you need to perform the permission check using the
credentials of the opener of the file.Which means refactoring
kill_pid so that you can perform the permission check for killing the
application during open.  Given that process credentials can change
completely during exec you also need to rule out a change in process
credentials making it so that the original opener of the file would not
be able to kill the process as it is now.

But overall this looks quite reasaonble.

Eric

> Signed-off-by: Daniel Colascione 
> ---
>  fs/proc/base.c | 39 +++
>  1 file changed, 39 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..923d62b21e67 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct 
> path *path)
>   return result;
>  }
>  
> +static ssize_t proc_pid_kill_write(struct file *file,
> +const char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';
> + res = kstrtoint(strstrip(buffer), 10, );
> + if (res)
> + goto out;
> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;
> + res = count;
> +out:
> + return res;
> +
> +}
> +
> +static const struct file_operations proc_pid_kill_ops = {
> + .write  = proc_pid_kill_write,
> +};
> +
>  static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
> size_t count, loff_t *ppos)
>  {
> @@ -2935,6 +2973,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
>   ONE("syscall",S_IRUSR, proc_pid_syscall),
>  #endif
> + REG("kill",   S_IRUGO | S_IWUGO, proc_pid_kill_ops),
>   REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
>   ONE("stat",   S_IRUGO, proc_tgid_stat),
>   ONE("statm",  S_IRUGO, proc_pid_statm),

Re: [PATCH v2] proc: use ns_capable instead of capable for timerslack_ns

2018-10-30 Thread Eric W. Biederman

Benjamin Gordon  writes:

> Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
> in its effective capability set, but the current check looks in the root
> namespace instead of the process' user namespace.  Since a process is
> allowed to do other activities controlled by CAP_SYS_NICE inside a
> namespace, it should also be able to adjust timerslack_ns.

Acked-by: "Eric W. Biederman" 

I don't see any fundamental probess with how the processes user
namespace is being accessed.   You can race with setns
and that may result in a descendent user namespace of the current
user namespace being set.  But if you have permissions in the parent
user namespace you will have permissions over a child user namespace.
So the race can't effect the outcome of the ns_capable test.

That and while __task_cred(p) may change it is guaranteed there is a
valid one until __put_task_struct which only happens when a process has
a zero refcount.  Which the success of get_proc_task in before these
checks already ensures is not true.

So from my perspective this looks like a reasonable change.

I don't know how this looks from people who understand the timer bits
and what timerslack does. I suspect it is reasonable as there is no
permission check for changing yourself.

Eric

> Signed-off-by: Benjamin Gordon 
> Cc: John Stultz 
> Cc: "Eric W. Biederman" 
> Cc: Kees Cook 
> Cc: "Serge E. Hallyn" 
> Cc: Thomas Gleixner 
> Cc: Arjan van de Ven 
> Cc: Oren Laadan 
> Cc: Ruchi Kandoi 
> Cc: Rom Lemarchand 
> Cc: Todd Kjos 
> Cc: Colin Cross 
> Cc: Nick Kralevich 
> Cc: Dmitry Shmidt 
> Cc: Elliott Hughes 
> Cc: Android Kernel Team 
> Cc: Andrew Morton 
> ---
>
> Changes from v1:
>   - Use the namespace of the target process instead of the file opener.
> Didn't carry over John Stultz' Acked-by since the changes aren't
> cosmetic.
>
>  fs/proc/base.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index c78d8da09b52c..bdc093ba81dd3 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2385,10 +2385,13 @@ static ssize_t timerslack_ns_write(struct file *file, 
> const char __user *buf,
>   return -ESRCH;
>  
>   if (p != current) {
> - if (!capable(CAP_SYS_NICE)) {
> + rcu_read_lock();
> + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> + rcu_read_unlock();
>   count = -EPERM;
>   goto out;
>   }
> + rcu_read_unlock();
>  
>   err = security_task_setscheduler(p);
>   if (err) {
> @@ -2421,11 +2424,14 @@ static int timerslack_ns_show(struct seq_file *m, 
> void *v)
>   return -ESRCH;
>  
>   if (p != current) {
> -
> - if (!capable(CAP_SYS_NICE)) {
> + rcu_read_lock();
> + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> + rcu_read_unlock();
>   err = -EPERM;
>   goto out;
>   }
> + rcu_read_unlock();
> +
>   err = security_task_getscheduler(p);
>   if (err)
>   goto out;

Re: [PATCH v2] proc: use ns_capable instead of capable for timerslack_ns

2018-10-30 Thread Eric W. Biederman

Benjamin Gordon  writes:

> Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
> in its effective capability set, but the current check looks in the root
> namespace instead of the process' user namespace.  Since a process is
> allowed to do other activities controlled by CAP_SYS_NICE inside a
> namespace, it should also be able to adjust timerslack_ns.

Acked-by: "Eric W. Biederman" 

I don't see any fundamental probess with how the processes user
namespace is being accessed.   You can race with setns
and that may result in a descendent user namespace of the current
user namespace being set.  But if you have permissions in the parent
user namespace you will have permissions over a child user namespace.
So the race can't effect the outcome of the ns_capable test.

That and while __task_cred(p) may change it is guaranteed there is a
valid one until __put_task_struct which only happens when a process has
a zero refcount.  Which the success of get_proc_task in before these
checks already ensures is not true.

So from my perspective this looks like a reasonable change.

I don't know how this looks from people who understand the timer bits
and what timerslack does. I suspect it is reasonable as there is no
permission check for changing yourself.

Eric

> Signed-off-by: Benjamin Gordon 
> Cc: John Stultz 
> Cc: "Eric W. Biederman" 
> Cc: Kees Cook 
> Cc: "Serge E. Hallyn" 
> Cc: Thomas Gleixner 
> Cc: Arjan van de Ven 
> Cc: Oren Laadan 
> Cc: Ruchi Kandoi 
> Cc: Rom Lemarchand 
> Cc: Todd Kjos 
> Cc: Colin Cross 
> Cc: Nick Kralevich 
> Cc: Dmitry Shmidt 
> Cc: Elliott Hughes 
> Cc: Android Kernel Team 
> Cc: Andrew Morton 
> ---
>
> Changes from v1:
>   - Use the namespace of the target process instead of the file opener.
> Didn't carry over John Stultz' Acked-by since the changes aren't
> cosmetic.
>
>  fs/proc/base.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index c78d8da09b52c..bdc093ba81dd3 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2385,10 +2385,13 @@ static ssize_t timerslack_ns_write(struct file *file, 
> const char __user *buf,
>   return -ESRCH;
>  
>   if (p != current) {
> - if (!capable(CAP_SYS_NICE)) {
> + rcu_read_lock();
> + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> + rcu_read_unlock();
>   count = -EPERM;
>   goto out;
>   }
> + rcu_read_unlock();
>  
>   err = security_task_setscheduler(p);
>   if (err) {
> @@ -2421,11 +2424,14 @@ static int timerslack_ns_show(struct seq_file *m, 
> void *v)
>   return -ESRCH;
>  
>   if (p != current) {
> -
> - if (!capable(CAP_SYS_NICE)) {
> + rcu_read_lock();
> + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> + rcu_read_unlock();
>   err = -EPERM;
>   goto out;
>   }
> + rcu_read_unlock();
> +
>   err = security_task_getscheduler(p);
>   if (err)
>   goto out;

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>> 
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>> 
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>> 
>> I know that we need to:
>> 
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

I will be and there are going to be both container and CRIU
mini-conferences.  So there should at least some of us around.

Eric

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>> 
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>> 
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>> 
>> I know that we need to:
>> 
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

I will be and there are going to be both container and CRIU
mini-conferences.  So there should at least some of us around.

Eric

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Eric W. Biederman

Jann Horn  writes:

> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman  
> wrote:
>> Enke Chen  writes:
>> > For simplicity and consistency, this patch provides an implementation
>> > for signal-based fault notification prior to the coredump of a child
>> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> > be used by an application to express its interest and to specify the
>> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>> >
>> > Changes to prctl(2):
>> >
>> >PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> >   Set the child pre-coredump signal of the calling process to
>> >   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>> >   This is the signal that the calling process will get prior to
>> >   the coredump of a child process. This value is cleared across
>> >   execve(2), or for the child of a fork(2).
>> >
>> >   When SIGCHLD is specified, the signal code will be set to
>> >   CLD_PREDUMP in such an SIGCHLD signal.
> [...]
>> Ugh.  Your test case is even using signalfd.  So you don't even want
>> this signal to be delivered as a signal.
>
> Just to make sure everyone's on the same page: You're suggesting that
> it might make sense to deliver the pre-dump notification via a new
> type of file instead (along the lines of signalfd, timerfd, eventfd
> and so on)?

My real complaint was that the API was not being tested in the way it
is expected to be used.  Which makes a test pretty much useless as some
aspect userspace could regress and the test would not notice because it
is testing something different.



I do think that a file descriptor based API might be a good alternative
to a signal based API.  The proc connector and signals are not the only
API solution.

The common solution to this problem is that distributions defailt the
rlimit core file size to 0.

Eric

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Eric W. Biederman

Jann Horn  writes:

> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman  
> wrote:
>> Enke Chen  writes:
>> > For simplicity and consistency, this patch provides an implementation
>> > for signal-based fault notification prior to the coredump of a child
>> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> > be used by an application to express its interest and to specify the
>> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>> >
>> > Changes to prctl(2):
>> >
>> >PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> >   Set the child pre-coredump signal of the calling process to
>> >   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>> >   This is the signal that the calling process will get prior to
>> >   the coredump of a child process. This value is cleared across
>> >   execve(2), or for the child of a fork(2).
>> >
>> >   When SIGCHLD is specified, the signal code will be set to
>> >   CLD_PREDUMP in such an SIGCHLD signal.
> [...]
>> Ugh.  Your test case is even using signalfd.  So you don't even want
>> this signal to be delivered as a signal.
>
> Just to make sure everyone's on the same page: You're suggesting that
> it might make sense to deliver the pre-dump notification via a new
> type of file instead (along the lines of signalfd, timerfd, eventfd
> and so on)?

My real complaint was that the API was not being tested in the way it
is expected to be used.  Which makes a test pretty much useless as some
aspect userspace could regress and the test would not notice because it
is testing something different.



I do think that a file descriptor based API might be a good alternative
to a signal based API.  The proc connector and signals are not the only
API solution.

The common solution to this problem is that distributions defailt the
rlimit core file size to 0.

Eric

Re: [PATCH] proc: use ns_capable instead of capable for timerslack_ns

2018-10-25 Thread Eric W. Biederman

bmgor...@google.com writes:

> From: Benjamin Gordon 
>
> Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
> in its effective capability set, but the current check looks in the root
> namespace instead of the process' user namespace.  Since a process is
> allowed to do other activities controlled by CAP_SYS_NICE inside a
> namespace, it should also be able to adjust timerslack_ns.

The goal seems legitimate.  However the permission checks look wrong.

In particular the choice of user namespace should be
"p->cred->user_ns".  This will limit this to tasks that have
CAP_SYS_NICE in the same namespace as the task that is being modified.

Testing file->f_cred->user_ns it is testing whoever opened the file and
that could be anyone.

Eric


> Signed-off-by: Benjamin Gordon 
> Cc: John Stultz 
> Cc: Kees Cook 
> Cc: "Serge E. Hallyn" 
> Cc: Thomas Gleixner 
> Cc: Arjan van de Ven 
> Cc: Oren Laadan 
> Cc: Ruchi Kandoi 
> Cc: Rom Lemarchand 
> Cc: Todd Kjos 
> Cc: Colin Cross 
> Cc: Nick Kralevich 
> Cc: Dmitry Shmidt 
> Cc: Elliott Hughes 
> Cc: Android Kernel Team 
> Cc: Andrew Morton 
> ---
>  fs/proc/base.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..4b50937dff80 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2356,7 +2356,7 @@ static ssize_t timerslack_ns_write(struct file *file, 
> const char __user *buf,
>   return -ESRCH;
>  
>   if (p != current) {
> - if (!capable(CAP_SYS_NICE)) {
> + if (!ns_capable(file->f_cred->user_ns, CAP_SYS_NICE)) {
>   count = -EPERM;
>   goto out;
>   }
> @@ -2393,7 +2393,7 @@ static int timerslack_ns_show(struct seq_file *m, void 
> *v)
>  
>   if (p != current) {
>  
> - if (!capable(CAP_SYS_NICE)) {
> + if (!ns_capable(seq_user_ns(m), CAP_SYS_NICE)) {
>   err = -EPERM;
>   goto out;
>   }

Re: [PATCH] proc: use ns_capable instead of capable for timerslack_ns

2018-10-25 Thread Eric W. Biederman

bmgor...@google.com writes:

> From: Benjamin Gordon 
>
> Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
> in its effective capability set, but the current check looks in the root
> namespace instead of the process' user namespace.  Since a process is
> allowed to do other activities controlled by CAP_SYS_NICE inside a
> namespace, it should also be able to adjust timerslack_ns.

The goal seems legitimate.  However the permission checks look wrong.

In particular the choice of user namespace should be
"p->cred->user_ns".  This will limit this to tasks that have
CAP_SYS_NICE in the same namespace as the task that is being modified.

Testing file->f_cred->user_ns it is testing whoever opened the file and
that could be anyone.

Eric


> Signed-off-by: Benjamin Gordon 
> Cc: John Stultz 
> Cc: Kees Cook 
> Cc: "Serge E. Hallyn" 
> Cc: Thomas Gleixner 
> Cc: Arjan van de Ven 
> Cc: Oren Laadan 
> Cc: Ruchi Kandoi 
> Cc: Rom Lemarchand 
> Cc: Todd Kjos 
> Cc: Colin Cross 
> Cc: Nick Kralevich 
> Cc: Dmitry Shmidt 
> Cc: Elliott Hughes 
> Cc: Android Kernel Team 
> Cc: Andrew Morton 
> ---
>  fs/proc/base.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..4b50937dff80 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2356,7 +2356,7 @@ static ssize_t timerslack_ns_write(struct file *file, 
> const char __user *buf,
>   return -ESRCH;
>  
>   if (p != current) {
> - if (!capable(CAP_SYS_NICE)) {
> + if (!ns_capable(file->f_cred->user_ns, CAP_SYS_NICE)) {
>   count = -EPERM;
>   goto out;
>   }
> @@ -2393,7 +2393,7 @@ static int timerslack_ns_show(struct seq_file *m, void 
> *v)
>  
>   if (p != current) {
>  
> - if (!capable(CAP_SYS_NICE)) {
> + if (!ns_capable(seq_user_ns(m), CAP_SYS_NICE)) {
>   err = -EPERM;
>   goto out;
>   }

Re: [PATCH] ARM: mm: Facilitate debugging CONFIG_KUSER_HELPERS disabled

2018-10-25 Thread Eric W. Biederman

Florian Fainelli  writes:

> Some software such as perf makes unconditional use of the special
> [vectors] page which is only provided when CONFIG_KUSER_HELPERS is
> enabled in the kernel.
>
> Facilitate the debugging of such situations by printing a debug message
> to the kernel log showing the task name and the faulting address.

Can't someone trigger this segv deliberately and spam the kerne log?
Doesn't this need something like printk_ratelimit or something to ensure
this message only gets printed once?

Eric

> Suggested-by: Russell King 
> Signed-off-by: Florian Fainelli 
> ---
>  arch/arm/mm/fault.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index f4ea4c62c613..f17471fbc1c4 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -173,6 +173,11 @@ __do_user_fault(struct task_struct *tsk, unsigned long 
> addr,
>   show_regs(regs);
>   }
>  #endif
> +#ifndef CONFIG_KUSER_HELPERS
> + if ((sig == SIGSEGV) && ((addr & PAGE_MASK) == 0x))
> + printk(KERN_DEBUG "%s: CONFIG_KUSER_HELPERS disabled at 
> 0x%08lx\n",
> +tsk->comm, addr);
> +#endif
>  
>   tsk->thread.address = addr;
>   tsk->thread.error_code = fsr;

Re: [PATCH] ARM: mm: Facilitate debugging CONFIG_KUSER_HELPERS disabled

2018-10-25 Thread Eric W. Biederman

Florian Fainelli  writes:

> Some software such as perf makes unconditional use of the special
> [vectors] page which is only provided when CONFIG_KUSER_HELPERS is
> enabled in the kernel.
>
> Facilitate the debugging of such situations by printing a debug message
> to the kernel log showing the task name and the faulting address.

Can't someone trigger this segv deliberately and spam the kerne log?
Doesn't this need something like printk_ratelimit or something to ensure
this message only gets printed once?

Eric

> Suggested-by: Russell King 
> Signed-off-by: Florian Fainelli 
> ---
>  arch/arm/mm/fault.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index f4ea4c62c613..f17471fbc1c4 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -173,6 +173,11 @@ __do_user_fault(struct task_struct *tsk, unsigned long 
> addr,
>   show_regs(regs);
>   }
>  #endif
> +#ifndef CONFIG_KUSER_HELPERS
> + if ((sig == SIGSEGV) && ((addr & PAGE_MASK) == 0x))
> + printk(KERN_DEBUG "%s: CONFIG_KUSER_HELPERS disabled at 
> 0x%08lx\n",
> +tsk->comm, addr);
> +#endif
>  
>   tsk->thread.address = addr;
>   tsk->thread.error_code = fsr;

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Eric W. Biederman

Enke Chen  writes:

> Hi, Eric:
>
> Thanks for your comments. Please see my replies inline.
>
> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>> Enke Chen  writes:
>> 
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Changes to prctl(2):
>>>
>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>   Set the child pre-coredump signal of the calling process to
>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>   This is the signal that the calling process will get prior to
>>>   the coredump of a child process. This value is cleared across
>>>   execve(2), or for the child of a fork(2).
>>>
>>>   When SIGCHLD is specified, the signal code will be set to
>>>   CLD_PREDUMP in such an SIGCHLD signal.
>> 
>> Your signal handling is still not right.  Please read and comprehend
>> siginfo_layout.
>> 
>> You have not filled in all of the required fields for the SIGCHLD case.
>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>> very wrong.  This is not a user generated signal.
>> 
>> Let me say this slowly.  The pair si_signo si_code determines the union
>> member of struct siginfo.  That needs to be handled consistently. You
>> aren't.  I just finished fixing this up in the entire kernel and now you
>> are trying to add a usage that is worst than most of the bugs I have
>> fixed.  I really don't appreciate having to deal with no bugs.
>> 
>
> My apologies. I will investigate and make them consistent.
>
>> 
>> 
>> Further siginfo can be dropped.  Multiple signals with the same signal
>> number can be consolidated.  What is your plan for dealing with that?
>
> The primary application for the early notification involves a process
> manager which is responsible for re-spawning processes or initiating
> the control-plane fail-over. There are two models:
>
> One model is to have 1:1 relationship between a process manager and
> application process. There can only be one predump-signal (say, SIGUSR1)
> from the child to the parent, and will unlikely be dropped or consolidated.
>
> Another model is to have 1:N where there is only one process manager with
> multiple application processes. One of the RT signal can be used to help
> make it more reliable.

Which suggests you want one of the negative si_codes, and to use the _rt
siginfo member like sigqueue.

>> Other code paths pair with wait to get the information out.  There
>> is no equivalent of wait in your code.
>
> I was not aware of that before.  Let me investigate.
>
>> 
>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>> not provide any meaningful kind of real time notification.
>> 
>
> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
> qualifies as "real time".  This mechanism has worked well in deployment
> over the years.

It would help if those numbers were put into the patch description so
people can tell if the mechanism is quick enough.

>> So between delays and loss of information signals appear to be a very
>> poor fit for this usecase.
>> 
>> I am concerned about code that does not fit the usecase well because
>> such code winds up as code that no one cares about that must be
>> maintained indefinitely, because somewhere out there there is one use
>> that would break if the interface was removed.  This does not feel like
>> an interface people will want to use and maintain in proper working
>> order forever.
>> 
>> Ugh.  Your test case is even using signalfd.  So you don't even want
>> this signal to be delivered as a signal.
>
> I actually tested sigaction()/waitpid() as well. If there is a preference,
> I can check in the sigaction()/waitpid() version instead.
>
>> 
>> You add an interface that takes a pointer and you don't add a compat
>> interface.  See Oleg's point of just returning the signal number in the
>> return code.
>
> This is what Oleg said "but I won't insist, this is subjective and cosmetic".
>
> It is no big deal either way. It just seems less work if we do not keep
> add

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Eric W. Biederman

Enke Chen  writes:

> Hi, Eric:
>
> Thanks for your comments. Please see my replies inline.
>
> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>> Enke Chen  writes:
>> 
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Changes to prctl(2):
>>>
>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>   Set the child pre-coredump signal of the calling process to
>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>   This is the signal that the calling process will get prior to
>>>   the coredump of a child process. This value is cleared across
>>>   execve(2), or for the child of a fork(2).
>>>
>>>   When SIGCHLD is specified, the signal code will be set to
>>>   CLD_PREDUMP in such an SIGCHLD signal.
>> 
>> Your signal handling is still not right.  Please read and comprehend
>> siginfo_layout.
>> 
>> You have not filled in all of the required fields for the SIGCHLD case.
>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>> very wrong.  This is not a user generated signal.
>> 
>> Let me say this slowly.  The pair si_signo si_code determines the union
>> member of struct siginfo.  That needs to be handled consistently. You
>> aren't.  I just finished fixing this up in the entire kernel and now you
>> are trying to add a usage that is worst than most of the bugs I have
>> fixed.  I really don't appreciate having to deal with no bugs.
>> 
>
> My apologies. I will investigate and make them consistent.
>
>> 
>> 
>> Further siginfo can be dropped.  Multiple signals with the same signal
>> number can be consolidated.  What is your plan for dealing with that?
>
> The primary application for the early notification involves a process
> manager which is responsible for re-spawning processes or initiating
> the control-plane fail-over. There are two models:
>
> One model is to have 1:1 relationship between a process manager and
> application process. There can only be one predump-signal (say, SIGUSR1)
> from the child to the parent, and will unlikely be dropped or consolidated.
>
> Another model is to have 1:N where there is only one process manager with
> multiple application processes. One of the RT signal can be used to help
> make it more reliable.

Which suggests you want one of the negative si_codes, and to use the _rt
siginfo member like sigqueue.

>> Other code paths pair with wait to get the information out.  There
>> is no equivalent of wait in your code.
>
> I was not aware of that before.  Let me investigate.
>
>> 
>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>> not provide any meaningful kind of real time notification.
>> 
>
> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
> qualifies as "real time".  This mechanism has worked well in deployment
> over the years.

It would help if those numbers were put into the patch description so
people can tell if the mechanism is quick enough.

>> So between delays and loss of information signals appear to be a very
>> poor fit for this usecase.
>> 
>> I am concerned about code that does not fit the usecase well because
>> such code winds up as code that no one cares about that must be
>> maintained indefinitely, because somewhere out there there is one use
>> that would break if the interface was removed.  This does not feel like
>> an interface people will want to use and maintain in proper working
>> order forever.
>> 
>> Ugh.  Your test case is even using signalfd.  So you don't even want
>> this signal to be delivered as a signal.
>
> I actually tested sigaction()/waitpid() as well. If there is a preference,
> I can check in the sigaction()/waitpid() version instead.
>
>> 
>> You add an interface that takes a pointer and you don't add a compat
>> interface.  See Oleg's point of just returning the signal number in the
>> return code.
>
> This is what Oleg said "but I won't insist, this is subjective and cosmetic".
>
> It is no big deal either way. It just seems less work if we do not keep
> add

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Eric W. Biederman

Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Changes to prctl(2):
>
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   Set the child pre-coredump signal of the calling process to
>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>   This is the signal that the calling process will get prior to
>   the coredump of a child process. This value is cleared across
>   execve(2), or for the child of a fork(2).
>
>   When SIGCHLD is specified, the signal code will be set to
>   CLD_PREDUMP in such an SIGCHLD signal.

Your signal handling is still not right.  Please read and comprehend
siginfo_layout.

You have not filled in all of the required fields for the SIGCHLD case.
For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
very wrong.  This is not a user generated signal.

Let me say this slowly.  The pair si_signo si_code determines the union
member of struct siginfo.  That needs to be handled consistently. You
aren't.  I just finished fixing this up in the entire kernel and now you
are trying to add a usage that is worst than most of the bugs I have
fixed.  I really don't appreciate having to deal with no bugs.

Further siginfo can be dropped.  Multiple signals with the same signal
number can be consolidated.  What is your plan for dealing with that?
Other code paths pair with wait to get the information out.  There
is no equivalent of wait in your code.

Signals can be delayed by quite a bit, scheduling delays etc.  They can
not provide any meaningful kind of real time notification.

So between delays and loss of information signals appear to be a very
poor fit for this usecase.

I am concerned about code that does not fit the usecase well because
such code winds up as code that no one cares about that must be
maintained indefinitely, because somewhere out there there is one use
that would break if the interface was removed.  This does not feel like
an interface people will want to use and maintain in proper working
order forever.

Ugh.  Your test case is even using signalfd.  So you don't even want
this signal to be delivered as a signal.

You add an interface that takes a pointer and you don't add a compat
interface.  See Oleg's point of just returning the signal number in the
return code.

Now I am wondering how well prctl works from a 32bit process on a 64bit
kernel.  At first glance it looks like it probably does not work.

Consistency with PDEATHSIG is not a good argument for anything.
PDEATHSIG at the present time is unusable in the real world by most
applications that want something like it.

So far I see an interface that even you don't want to use as designed,
that is implemented incorrectly.

The concern is real and deserves to be addressed.  I don't think signals
are the right way to handle it, and certainly not this patch as it
stands.

Eric

> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..f11e31f 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   struct cred *cred;
>   int retval = 0;
>   int ispipe;
> + bool notify;
>   struct files_struct *displaced;
>   /* require nonrelative corefile path and be extra careful */
>   bool need_suid_safe = false;
> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   if (retval < 0)
>   goto fail_creds;
>  
> + /*
> +  * Send the pre-coredump signal to the parent if requested.
> +  */
> + read_lock(_lock);
> + notify = do_notify_parent_predump(current);
> + read_unlock(_lock);
> + if (notify)
> + cond_resched();
> +
>   old_cred = override_creds(cred);
>  
>   ispipe = format_corename(, );
> diff --git a/fs/exec.c b/fs/exec.c
> index fc281b7..7714da7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
>   /* we have changed execution domain */
>

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Eric W. Biederman

Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Changes to prctl(2):
>
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   Set the child pre-coredump signal of the calling process to
>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>   This is the signal that the calling process will get prior to
>   the coredump of a child process. This value is cleared across
>   execve(2), or for the child of a fork(2).
>
>   When SIGCHLD is specified, the signal code will be set to
>   CLD_PREDUMP in such an SIGCHLD signal.

Your signal handling is still not right.  Please read and comprehend
siginfo_layout.

You have not filled in all of the required fields for the SIGCHLD case.
For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
very wrong.  This is not a user generated signal.

Let me say this slowly.  The pair si_signo si_code determines the union
member of struct siginfo.  That needs to be handled consistently. You
aren't.  I just finished fixing this up in the entire kernel and now you
are trying to add a usage that is worst than most of the bugs I have
fixed.  I really don't appreciate having to deal with no bugs.

Further siginfo can be dropped.  Multiple signals with the same signal
number can be consolidated.  What is your plan for dealing with that?
Other code paths pair with wait to get the information out.  There
is no equivalent of wait in your code.

Signals can be delayed by quite a bit, scheduling delays etc.  They can
not provide any meaningful kind of real time notification.

So between delays and loss of information signals appear to be a very
poor fit for this usecase.

I am concerned about code that does not fit the usecase well because
such code winds up as code that no one cares about that must be
maintained indefinitely, because somewhere out there there is one use
that would break if the interface was removed.  This does not feel like
an interface people will want to use and maintain in proper working
order forever.

Ugh.  Your test case is even using signalfd.  So you don't even want
this signal to be delivered as a signal.

You add an interface that takes a pointer and you don't add a compat
interface.  See Oleg's point of just returning the signal number in the
return code.

Now I am wondering how well prctl works from a 32bit process on a 64bit
kernel.  At first glance it looks like it probably does not work.

Consistency with PDEATHSIG is not a good argument for anything.
PDEATHSIG at the present time is unusable in the real world by most
applications that want something like it.

So far I see an interface that even you don't want to use as designed,
that is implemented incorrectly.

The concern is real and deserves to be addressed.  I don't think signals
are the right way to handle it, and certainly not this patch as it
stands.

Eric

> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..f11e31f 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   struct cred *cred;
>   int retval = 0;
>   int ispipe;
> + bool notify;
>   struct files_struct *displaced;
>   /* require nonrelative corefile path and be extra careful */
>   bool need_suid_safe = false;
> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   if (retval < 0)
>   goto fail_creds;
>  
> + /*
> +  * Send the pre-coredump signal to the parent if requested.
> +  */
> + read_lock(_lock);
> + notify = do_notify_parent_predump(current);
> + read_unlock(_lock);
> + if (notify)
> + cond_resched();
> +
>   old_cred = override_creds(cred);
>  
>   ispipe = format_corename(, );
> diff --git a/fs/exec.c b/fs/exec.c
> index fc281b7..7714da7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
>   /* we have changed execution domain */
>

[GIT PULL] siginfo updates for 4.20-rc1

2018-10-22 Thread Eric W. Biederman



Linus,

Please pull the siginfo-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
siginfo-linus

   HEAD: a36700589b85443e28170be59fa11c8a104130a5 signal: Guard against 
negative signal numbers in copy_siginfo_from_user32

I have been slowly sorting out siginfo and this is the culmination of that work.

The primary result is in several ways the signal infrastructure has been made
less error prone.  The code has been updated so that manually specifying
SEND_SIG_FORCED is never necessary.  The conversion to the new siginfo sending
functions is now complete, which makes it difficult to send a signal without
filling in the proper siginfo fields.

At the tail end of the patchset comes the optimization of decreasing the size of
struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit.  The
fundamental observation that enables this is by definition none of the known
ways to use struct siginfo uses the extra bytes.

This comes at the cost of a small user space observable difference.  For the
rare case of siginfo being injected into the kernel only what can be copied
into kernel_siginfo is delivered to the destination, the rest of the bytes are
set to 0.  For cases where the signal and the si_code are known this is safe,
because we know thos bytes are not used.  For cases where the signal and si_code
combination is unknown the bits that won't fit into struct kernel_siginfo are
tested to verify they are zero, and the send fails if they are not.

I made an extensive search through userspace code and I could not find anything
that would break because of the above change.  If it turns out I did break
something it will take just the revert of a single change to restore
kernel_siginfo to the same size as userspace siginfo.

Testing did reveal dependencies on preferring the signo passed to sigqueueinfo
over si->signo, so bit the bullet and added the complexity necessary to handle
that case.

Testing also revealed bad things can happen if a negative signal number is
passed into the system calls.  Something no sane application will do but
something a malicious program or a fuzzer might do.  So I have fixed the code
that performs the bounds checks to ensure negative signal numbers are handled.


There are minor conflicts between this tree and several other trees.
- The x86 tree
- The y2038 tree
- The arm64 tree
- The x86 tip tree

I think only the resolution of the x86 tip tree is at all difficult.  None of
the conflicts are fundamental.  They are all from changes to other parts of the
code that are just close enough to have context conflicts.  The x86 tip tree
conflict actually involves a conflict from removing a unnecessary pkey parameter
on the siginfo side and a some small refactoring on the x86 side.

Eric W. Biederman (80):
  signal: Always ignore SIGKILL and SIGSTOP sent to the global init
  signal: Properly deliver SIGILL from uprobes
  signal: Properly deliver SIGSEGV from x86 uprobes
  signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid 
namespace init
  signal: send_sig_all no longer needs SEND_SIG_FORCED
  signal: Remove the siginfo paramater from kernel_dqueue_signal
  signal: Don't send siginfo to kthreads.
  signal: Never allocate siginfo for SIGKILL or SIGSTOP
  signal: Use SEND_SIG_PRIV not SEND_SIG_FORCED with SIGKILL and SIGSTOP
  signal: Remove SEND_SIG_FORCED
  signal/GenWQE: Fix sending of SIGKILL
  tty_io: Use group_send_sig_info in __do_SACK to note it is a session 
being killed
  signal: Use group_send_sig_info to kill all processes in a pid namespace
  signal: Remove specific_send_sig_info
  signal: Pair exports with their functions
  signal: Simplify tracehook_report_syscall_exit
  signal/x86: Inline fill_sigtrap_info in it's only caller send_sigtrap
  signal/x86: Move MCE error reporting out of force_sig_info_fault
  signal/x86: Use send_sig_mceerr as apropriate
  signal/x86: In trace_mpx_bounds_register_exception add __user annotations
  signal/x86: Move mpx siginfo generation into do_bounds
  signal/x86/traps: Factor out show_signal
  signal/x86/traps: Move more code into do_trap_no_signal so it can be 
reused
  signal/x86/traps: Use force_sig_bnderr
  signal/x86/traps: Use force_sig instead of open coding it.
  signal/x86/traps: Simplify trap generation
  signal/x86: Remove pkey parameter from bad_area_nosemaphore
  signal/x86: Remove the pkey parameter from do_sigbus
  signal/x86: Remove pkey parameter from mm_fault_error
  signal/x86: Don't compute pkey in __do_page_fault
  signal/x86: Pass pkey not vma into __bad_area
  signal/x86: Call force_sig_pkuerr from __bad_area_nosemaphore
  signal/x86: Replace force_sig_info_fault with force_sig_fault
  signal/x86: Pass pkey by value
  signal/x86: Use force_sig_fault where appropriate
  signal/powerpc:

[GIT PULL] siginfo updates for 4.20-rc1

2018-10-22 Thread Eric W. Biederman



Linus,

Please pull the siginfo-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
siginfo-linus

   HEAD: a36700589b85443e28170be59fa11c8a104130a5 signal: Guard against 
negative signal numbers in copy_siginfo_from_user32

I have been slowly sorting out siginfo and this is the culmination of that work.

The primary result is in several ways the signal infrastructure has been made
less error prone.  The code has been updated so that manually specifying
SEND_SIG_FORCED is never necessary.  The conversion to the new siginfo sending
functions is now complete, which makes it difficult to send a signal without
filling in the proper siginfo fields.

At the tail end of the patchset comes the optimization of decreasing the size of
struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit.  The
fundamental observation that enables this is by definition none of the known
ways to use struct siginfo uses the extra bytes.

This comes at the cost of a small user space observable difference.  For the
rare case of siginfo being injected into the kernel only what can be copied
into kernel_siginfo is delivered to the destination, the rest of the bytes are
set to 0.  For cases where the signal and the si_code are known this is safe,
because we know thos bytes are not used.  For cases where the signal and si_code
combination is unknown the bits that won't fit into struct kernel_siginfo are
tested to verify they are zero, and the send fails if they are not.

I made an extensive search through userspace code and I could not find anything
that would break because of the above change.  If it turns out I did break
something it will take just the revert of a single change to restore
kernel_siginfo to the same size as userspace siginfo.

Testing did reveal dependencies on preferring the signo passed to sigqueueinfo
over si->signo, so bit the bullet and added the complexity necessary to handle
that case.

Testing also revealed bad things can happen if a negative signal number is
passed into the system calls.  Something no sane application will do but
something a malicious program or a fuzzer might do.  So I have fixed the code
that performs the bounds checks to ensure negative signal numbers are handled.


There are minor conflicts between this tree and several other trees.
- The x86 tree
- The y2038 tree
- The arm64 tree
- The x86 tip tree

I think only the resolution of the x86 tip tree is at all difficult.  None of
the conflicts are fundamental.  They are all from changes to other parts of the
code that are just close enough to have context conflicts.  The x86 tip tree
conflict actually involves a conflict from removing a unnecessary pkey parameter
on the siginfo side and a some small refactoring on the x86 side.

Eric W. Biederman (80):
  signal: Always ignore SIGKILL and SIGSTOP sent to the global init
  signal: Properly deliver SIGILL from uprobes
  signal: Properly deliver SIGSEGV from x86 uprobes
  signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid 
namespace init
  signal: send_sig_all no longer needs SEND_SIG_FORCED
  signal: Remove the siginfo paramater from kernel_dqueue_signal
  signal: Don't send siginfo to kthreads.
  signal: Never allocate siginfo for SIGKILL or SIGSTOP
  signal: Use SEND_SIG_PRIV not SEND_SIG_FORCED with SIGKILL and SIGSTOP
  signal: Remove SEND_SIG_FORCED
  signal/GenWQE: Fix sending of SIGKILL
  tty_io: Use group_send_sig_info in __do_SACK to note it is a session 
being killed
  signal: Use group_send_sig_info to kill all processes in a pid namespace
  signal: Remove specific_send_sig_info
  signal: Pair exports with their functions
  signal: Simplify tracehook_report_syscall_exit
  signal/x86: Inline fill_sigtrap_info in it's only caller send_sigtrap
  signal/x86: Move MCE error reporting out of force_sig_info_fault
  signal/x86: Use send_sig_mceerr as apropriate
  signal/x86: In trace_mpx_bounds_register_exception add __user annotations
  signal/x86: Move mpx siginfo generation into do_bounds
  signal/x86/traps: Factor out show_signal
  signal/x86/traps: Move more code into do_trap_no_signal so it can be 
reused
  signal/x86/traps: Use force_sig_bnderr
  signal/x86/traps: Use force_sig instead of open coding it.
  signal/x86/traps: Simplify trap generation
  signal/x86: Remove pkey parameter from bad_area_nosemaphore
  signal/x86: Remove the pkey parameter from do_sigbus
  signal/x86: Remove pkey parameter from mm_fault_error
  signal/x86: Don't compute pkey in __do_page_fault
  signal/x86: Pass pkey not vma into __bad_area
  signal/x86: Call force_sig_pkuerr from __bad_area_nosemaphore
  signal/x86: Replace force_sig_info_fault with force_sig_fault
  signal/x86: Pass pkey by value
  signal/x86: Use force_sig_fault where appropriate
  signal/powerpc:

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-22 Thread Eric W. Biederman

"Paul E. McKenney"  writes:

> On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote:
>> On Sat, 20 Oct 2018, Paul E. McKenney wrote:
>> 
>> > The second (informal) litmus test has a more interesting Linux-kernel
>> > counterpart:
>> > 
>> >void t1_interrupt(void)
>> >{
>> >r0 = READ_ONCE(y);
>> >smp_store_release(, 1);
>> >}
>> > 
>> >void t1(void)
>> >{
>> >smp_store_release(, 1);
>> >}
>> > 
>> >void t2(void)
>> >{
>> >r1 = smp_load_acquire();
>> >r2 = smp_load_acquire();
>> >}
>> > 
>> > On store-reordering architectures that implement smp_store_release()
>> > as a memory-barrier instruction followed by a store, the interrupt could
>> > arrive betweentimes in t1(), so that there would be no ordering between
>> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
>> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
>> 
>> This is disconcerting only if we assume that t1_interrupt() has to be
>> executed by the same CPU as t1().  If the interrupt could be fielded by
>> a different CPU then the paranoid outcome is perfectly understandable,
>> even in an SC context.
>> 
>> So the question really should be limited to situations where a handler 
>> is forced to execute in the context of a particular thread.  While 
>> POSIX does allow such restrictions for user programs, I'm not aware of 
>> any similar mechanism in the kernel.

> Good point, and I was in fact assuming that t1() and t1_interrupt()
> were executing on the same CPU.
>
> This sort of thing happens naturally in the kernel when both t1()
> and t1_interrupt() are accessing per-CPU variables.

Interrupts have a cpumask of the cpus they may be dlievered on.

I believe networking does in fact have places where percpu actions
happen as well as interrupts pinned to a single cpu.  And yes I agree
percpu variables mean that you do not need to pin an interrupt to a
single cpu to cause this to happen.

Eric

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-22 Thread Eric W. Biederman

"Paul E. McKenney"  writes:

> On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote:
>> On Sat, 20 Oct 2018, Paul E. McKenney wrote:
>> 
>> > The second (informal) litmus test has a more interesting Linux-kernel
>> > counterpart:
>> > 
>> >void t1_interrupt(void)
>> >{
>> >r0 = READ_ONCE(y);
>> >smp_store_release(, 1);
>> >}
>> > 
>> >void t1(void)
>> >{
>> >smp_store_release(, 1);
>> >}
>> > 
>> >void t2(void)
>> >{
>> >r1 = smp_load_acquire();
>> >r2 = smp_load_acquire();
>> >}
>> > 
>> > On store-reordering architectures that implement smp_store_release()
>> > as a memory-barrier instruction followed by a store, the interrupt could
>> > arrive betweentimes in t1(), so that there would be no ordering between
>> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
>> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
>> 
>> This is disconcerting only if we assume that t1_interrupt() has to be
>> executed by the same CPU as t1().  If the interrupt could be fielded by
>> a different CPU then the paranoid outcome is perfectly understandable,
>> even in an SC context.
>> 
>> So the question really should be limited to situations where a handler 
>> is forced to execute in the context of a particular thread.  While 
>> POSIX does allow such restrictions for user programs, I'm not aware of 
>> any similar mechanism in the kernel.

> Good point, and I was in fact assuming that t1() and t1_interrupt()
> were executing on the same CPU.
>
> This sort of thing happens naturally in the kernel when both t1()
> and t1_interrupt() are accessing per-CPU variables.

Interrupts have a cpumask of the cpus they may be dlievered on.

I believe networking does in fact have places where percpu actions
happen as well as interrupts pinned to a single cpu.  And yes I agree
percpu variables mean that you do not need to pin an interrupt to a
single cpu to cause this to happen.

Eric

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-21 Thread Eric W. Biederman

David Howells  writes:

> From: Al Viro 
>
> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
> attached by move_mount(2).
>
> If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
> to handle detached source.
>
> That gives us equivalents of mount --bind and mount --rbind.

In light of recent conversations about double umount_tree.

Do we want to simply limit ourselves to attaching file->f_path of
a FMODE_NEED_UMOUNT file descriptor and clearing FMODE_NEED_UMOUNT
when it is attached?

Either that or perhaps move the logic into mntput on when to perform the
umount_tree?

Otherwise I believe we start running into surprising situations:

This works:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
mount_move(dup_fd, ...);
close(ofd);
close(dup_fd);

This should fail:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
close(ofd);
mount_move(dup_fd, ...);
close(dup_fd);

Allowing any file descriptor that points to mnt_ns == NULL without
MNT_UMOUNT set seems like it is adding flexibility for no reason.


> Signed-off-by: Al Viro 
> Signed-off-by: David Howells 
> ---
>
>  fs/namespace.c |   26 --
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index dd38141b1723..caf5c55ef555 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
>  {
>   namespace_lock();
>   lock_mount_hash();
> - mntget(mnt);
> - umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + if (!real_mount(mnt)->mnt_ns) {
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + }

^^ This change should be unnecessary.

>   unlock_mount_hash();
>   namespace_unlock();
>  }
> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   struct mount *old;
>   struct mountpoint *mp;
>   int err;
> + bool attached;
>  
>   mp = lock_mount(new_path);
>   err = PTR_ERR(mp);
> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, 
> struct path *new_path)
>   p = real_mount(new_path->mnt);
>  
>   err = -EINVAL;
> - if (!check_mnt(p) || !check_mnt(old))
> + /* The mountpoint must be in our namespace. */
> + if (!check_mnt(p))
> + goto out1;
> + /* The thing moved should be either ours or completely unattached. */
> + if (old->mnt_ns && !check_mnt(old))
>   goto out1;

^^^

!old->mnt_ns  should only be allowed when it comes from a file
 descriptor with FMODE_NEED_UMOUNT.


> - if (!mnt_has_parent(old))
> + attached = mnt_has_parent(old);
> + /*
> +  * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> +  * move_mount(), but mustn't allow "/" to be moved.
> +  */
> + if (old->mnt_ns && !attached)
>   goto out1;
>  
>   if (old->mnt.mnt_flags & MNT_LOCKED)
> @@ -2421,7 +2433,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   /*
>* Don't move a mount residing in a shared parent.
>*/
> - if (IS_MNT_SHARED(old->mnt_parent))
> + if (attached && IS_MNT_SHARED(old->mnt_parent))
>   goto out1;
>   /*
>* Don't move a mount tree containing unbindable mounts to a destination
> @@ -2435,7 +2447,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   goto out1;
>  
>   err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> -_path);
> +attached ? _path : NULL);
>   if (err)
>   goto out1;

^^^
Somewhere around here we should have code that clears FMODE_NEED_UMOUNT.


> @@ -3121,6 +3133,8 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char 
> __user *, dir_name,
>  
>  /*
>   * Move a mount from one place to another.
> + * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
> + * used to copy a mount subtree.
>   *
>   * Note the flags value is a combination of MOVE_MOUNT_* flags.
>   */

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-21 Thread Eric W. Biederman

David Howells  writes:

> From: Al Viro 
>
> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
> attached by move_mount(2).
>
> If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
> to handle detached source.
>
> That gives us equivalents of mount --bind and mount --rbind.

In light of recent conversations about double umount_tree.

Do we want to simply limit ourselves to attaching file->f_path of
a FMODE_NEED_UMOUNT file descriptor and clearing FMODE_NEED_UMOUNT
when it is attached?

Either that or perhaps move the logic into mntput on when to perform the
umount_tree?

Otherwise I believe we start running into surprising situations:

This works:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
mount_move(dup_fd, ...);
close(ofd);
close(dup_fd);

This should fail:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
close(ofd);
mount_move(dup_fd, ...);
close(dup_fd);

Allowing any file descriptor that points to mnt_ns == NULL without
MNT_UMOUNT set seems like it is adding flexibility for no reason.


> Signed-off-by: Al Viro 
> Signed-off-by: David Howells 
> ---
>
>  fs/namespace.c |   26 --
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index dd38141b1723..caf5c55ef555 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
>  {
>   namespace_lock();
>   lock_mount_hash();
> - mntget(mnt);
> - umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + if (!real_mount(mnt)->mnt_ns) {
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + }

^^ This change should be unnecessary.

>   unlock_mount_hash();
>   namespace_unlock();
>  }
> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   struct mount *old;
>   struct mountpoint *mp;
>   int err;
> + bool attached;
>  
>   mp = lock_mount(new_path);
>   err = PTR_ERR(mp);
> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, 
> struct path *new_path)
>   p = real_mount(new_path->mnt);
>  
>   err = -EINVAL;
> - if (!check_mnt(p) || !check_mnt(old))
> + /* The mountpoint must be in our namespace. */
> + if (!check_mnt(p))
> + goto out1;
> + /* The thing moved should be either ours or completely unattached. */
> + if (old->mnt_ns && !check_mnt(old))
>   goto out1;

^^^

!old->mnt_ns  should only be allowed when it comes from a file
 descriptor with FMODE_NEED_UMOUNT.


> - if (!mnt_has_parent(old))
> + attached = mnt_has_parent(old);
> + /*
> +  * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> +  * move_mount(), but mustn't allow "/" to be moved.
> +  */
> + if (old->mnt_ns && !attached)
>   goto out1;
>  
>   if (old->mnt.mnt_flags & MNT_LOCKED)
> @@ -2421,7 +2433,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   /*
>* Don't move a mount residing in a shared parent.
>*/
> - if (IS_MNT_SHARED(old->mnt_parent))
> + if (attached && IS_MNT_SHARED(old->mnt_parent))
>   goto out1;
>   /*
>* Don't move a mount tree containing unbindable mounts to a destination
> @@ -2435,7 +2447,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   goto out1;
>  
>   err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> -_path);
> +attached ? _path : NULL);
>   if (err)
>   goto out1;

^^^
Somewhere around here we should have code that clears FMODE_NEED_UMOUNT.


> @@ -3121,6 +3133,8 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char 
> __user *, dir_name,
>  
>  /*
>   * Move a mount from one place to another.
> + * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
> + * used to copy a mount subtree.
>   *
>   * Note the flags value is a combination of MOVE_MOUNT_* flags.
>   */

Re: [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]

2018-10-21 Thread Eric W. Biederman

David Howells  writes:

> From: Al Viro 
>
> open_tree(dfd, pathname, flags)
>
> Returns an O_PATH-opened file descriptor or an error.
> dfd and pathname specify the location to open, in usual
> fashion (see e.g. fstatat(2)).  flags should be an OR of
> some of the following:
>   * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
> same meanings as usual
>   * OPEN_TREE_CLOEXEC - make the resulting descriptor
> close-on-exec
>   * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
> instead of opening the location in question, create a detached
> mount tree matching the subtree rooted at location specified by
> dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
> without it - only the part within in the mount containing the
> location in question.  In other words, the same as mount --rbind
> or mount --bind would've taken.  The detached tree will be
> dissolved on the final close of obtained file.  Creation of such
> detached trees requires the same capabilities as doing mount --bind.


What happens when mounts propgate to such a detached mount tree?

It looks to me like the test in propagate_one for setting
CL_UNPRIVILEGED will trigger a NULL pointer dereference:

AKA:
/* Notice when we are propagating across user namespaces */
if (m->mnt_ns->user_ns != user_ns)
type |= CL_UNPRIVILEGED;

Since we don't know which mount namespace if any this subtree is going
into the test should become:

if (!m->mnt_ns || (m->mnt_ns->user_ns != user_ns))
type |= CL_UNPRIVILEGED;

If the tree is never attached anywhere it won't hurt as we don't allow
mounts or umounts on the detached subtree.  We don't have enough
information to know about the namespace we copied from if it would cause
CL_UNPRIVILEGED to be set on any given propagation.  Similarly we don't
have any information at all about the mount namespace this tree may
possibily be copied to.

Eric


> Signed-off-by: Al Viro 
> Signed-off-by: David Howells 
> cc: linux-...@vger.kernel.org
> ---
>
>  arch/x86/entry/syscalls/syscall_32.tbl |1 
>  arch/x86/entry/syscalls/syscall_64.tbl |1 
>  fs/file_table.c|9 +-
>  fs/internal.h  |1 
>  fs/namespace.c |  132 
> +++-
>  include/linux/fs.h |7 +-
>  include/linux/syscalls.h   |1 
>  include/uapi/linux/fcntl.h |2 
>  include/uapi/linux/mount.h |   10 ++
>  9 files changed, 137 insertions(+), 27 deletions(-)
>  create mode 100644 include/uapi/linux/mount.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..ea1b413afd47 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384  i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385  i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386  i386rseqsys_rseq
> __ia32_sys_rseq
> +387  i386open_tree   sys_open_tree   
> __ia32_sys_open_tree
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..0545bed581dc 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332  common  statx   __x64_sys_statx
>  333  common  io_pgetevents   __x64_sys_io_pgetevents
>  334  common  rseq__x64_sys_rseq
> +335  common  open_tree   __x64_sys_open_tree
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/file_table.c b/fs/file_table.c
> index e49af4caf15d..e03c8d121c6c 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -255,6 +255,7 @@ static void __fput(struct file *file)
>   struct dentry *dentry = file->f_path.dentry;
>   struct vfsmount *mnt = file->f_path.mnt;
>   struct inode *inode = file->f_inode;
> + fmode_t mode = file->f_mode;
>  
>   if (unlikely(!(file->f_mode & FMODE_OPENED)))
>   goto out;
> @@ -277,18 +278,20 @@ static void __fput(struct file *file)
>   if (file->f_op->release)
>   file->f_op->release(inode, file);
>   if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
> -  !(file->f_mode & FMODE_PATH))) {
> +  !(mode & FMODE_PATH))) {
>   cdev_put(inode->i_cdev);
>   }
>   fops_put(file->f_op);
>   put_pid(file->f_owner.pid);
> - if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
> + if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>   i_readcount_dec(inode);
> - if (file->f_mode & FMODE_WRITER) {
> +

Re: [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]

2018-10-21 Thread Eric W. Biederman

David Howells  writes:

> From: Al Viro 
>
> open_tree(dfd, pathname, flags)
>
> Returns an O_PATH-opened file descriptor or an error.
> dfd and pathname specify the location to open, in usual
> fashion (see e.g. fstatat(2)).  flags should be an OR of
> some of the following:
>   * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
> same meanings as usual
>   * OPEN_TREE_CLOEXEC - make the resulting descriptor
> close-on-exec
>   * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
> instead of opening the location in question, create a detached
> mount tree matching the subtree rooted at location specified by
> dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
> without it - only the part within in the mount containing the
> location in question.  In other words, the same as mount --rbind
> or mount --bind would've taken.  The detached tree will be
> dissolved on the final close of obtained file.  Creation of such
> detached trees requires the same capabilities as doing mount --bind.


What happens when mounts propgate to such a detached mount tree?

It looks to me like the test in propagate_one for setting
CL_UNPRIVILEGED will trigger a NULL pointer dereference:

AKA:
/* Notice when we are propagating across user namespaces */
if (m->mnt_ns->user_ns != user_ns)
type |= CL_UNPRIVILEGED;

Since we don't know which mount namespace if any this subtree is going
into the test should become:

if (!m->mnt_ns || (m->mnt_ns->user_ns != user_ns))
type |= CL_UNPRIVILEGED;

If the tree is never attached anywhere it won't hurt as we don't allow
mounts or umounts on the detached subtree.  We don't have enough
information to know about the namespace we copied from if it would cause
CL_UNPRIVILEGED to be set on any given propagation.  Similarly we don't
have any information at all about the mount namespace this tree may
possibily be copied to.

Eric


> Signed-off-by: Al Viro 
> Signed-off-by: David Howells 
> cc: linux-...@vger.kernel.org
> ---
>
>  arch/x86/entry/syscalls/syscall_32.tbl |1 
>  arch/x86/entry/syscalls/syscall_64.tbl |1 
>  fs/file_table.c|9 +-
>  fs/internal.h  |1 
>  fs/namespace.c |  132 
> +++-
>  include/linux/fs.h |7 +-
>  include/linux/syscalls.h   |1 
>  include/uapi/linux/fcntl.h |2 
>  include/uapi/linux/mount.h |   10 ++
>  9 files changed, 137 insertions(+), 27 deletions(-)
>  create mode 100644 include/uapi/linux/mount.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..ea1b413afd47 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384  i386arch_prctl  sys_arch_prctl  
> __ia32_compat_sys_arch_prctl
>  385  i386io_pgetevents   sys_io_pgetevents   
> __ia32_compat_sys_io_pgetevents
>  386  i386rseqsys_rseq
> __ia32_sys_rseq
> +387  i386open_tree   sys_open_tree   
> __ia32_sys_open_tree
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..0545bed581dc 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332  common  statx   __x64_sys_statx
>  333  common  io_pgetevents   __x64_sys_io_pgetevents
>  334  common  rseq__x64_sys_rseq
> +335  common  open_tree   __x64_sys_open_tree
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/file_table.c b/fs/file_table.c
> index e49af4caf15d..e03c8d121c6c 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -255,6 +255,7 @@ static void __fput(struct file *file)
>   struct dentry *dentry = file->f_path.dentry;
>   struct vfsmount *mnt = file->f_path.mnt;
>   struct inode *inode = file->f_inode;
> + fmode_t mode = file->f_mode;
>  
>   if (unlikely(!(file->f_mode & FMODE_OPENED)))
>   goto out;
> @@ -277,18 +278,20 @@ static void __fput(struct file *file)
>   if (file->f_op->release)
>   file->f_op->release(inode, file);
>   if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
> -  !(file->f_mode & FMODE_PATH))) {
> +  !(mode & FMODE_PATH))) {
>   cdev_put(inode->i_cdev);
>   }
>   fops_put(file->f_op);
>   put_pid(file->f_owner.pid);
> - if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
> + if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>   i_readcount_dec(inode);
> - if (file->f_mode & FMODE_WRITER) {
> +

Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

2018-10-17 Thread Eric W. Biederman

David Howells  writes:

> Eric W. Biederman  wrote:
>
>> Davids check will work for bind mounts as well.  It just won't work it
>> just won't work for files or subdirectories of some mountpoint.
>
> Bind mounts have to be done with open_tree() and move_mount().  You can't now
> do fsmount() on something fspick()'d.

But a bind mount will have mnt_root set to the the dentry that was
bound.

Therefore fspick as you are proposing modifying will work for the root
of bind mounts, as well as the root of regular mounts.  My apologies for
not being clear.

Eric

Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

2018-10-17 Thread Eric W. Biederman

David Howells  writes:

> Eric W. Biederman  wrote:
>
>> Davids check will work for bind mounts as well.  It just won't work it
>> just won't work for files or subdirectories of some mountpoint.
>
> Bind mounts have to be done with open_tree() and move_mount().  You can't now
> do fsmount() on something fspick()'d.

But a bind mount will have mnt_root set to the the dentry that was
bound.

Therefore fspick as you are proposing modifying will work for the root
of bind mounts, as well as the root of regular mounts.  My apologies for
not being clear.

Eric

Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

2018-10-17 Thread Eric W. Biederman

Alan Jenkins  writes:

> On 17/10/2018 14:20, David Howells wrote:
>> David Howells  wrote:
>>
>>> I should probably check that the picked point is actually a mountpoint.
>> The root of the mount object at the path specified, that is, perhaps with
>> something like the attached.
>>
>> David
>
>
> I agree.  I'm happy to see this is using the same check as do_remount().
>
>
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Davids check will work for bind mounts as well.  It just won't work it
just won't work for files or subdirectories of some mountpoint.

Eric

>> ---
>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>> index f673e93ac456..a17a233c 100644
>> --- a/fs/fsopen.c
>> +++ b/fs/fsopen.c
>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, 
>> path, unsigned int, flags
>>  if (ret < 0)
>>  goto err;
>>   +  ret = -EINVAL;
>> +if (target.mnt->mnt_root != target.dentry)
>> +goto err_path;
>> +
>>  fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>>  0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>>  if (IS_ERR(fc)) {
>>

Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

2018-10-17 Thread Eric W. Biederman

Alan Jenkins  writes:

> On 17/10/2018 14:20, David Howells wrote:
>> David Howells  wrote:
>>
>>> I should probably check that the picked point is actually a mountpoint.
>> The root of the mount object at the path specified, that is, perhaps with
>> something like the attached.
>>
>> David
>
>
> I agree.  I'm happy to see this is using the same check as do_remount().
>
>
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Davids check will work for bind mounts as well.  It just won't work it
just won't work for files or subdirectories of some mountpoint.

Eric

>> ---
>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>> index f673e93ac456..a17a233c 100644
>> --- a/fs/fsopen.c
>> +++ b/fs/fsopen.c
>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, 
>> path, unsigned int, flags
>>  if (ret < 0)
>>  goto err;
>>   +  ret = -EINVAL;
>> +if (target.mnt->mnt_root != target.dentry)
>> +goto err_path;
>> +
>>  fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>>  0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>>  if (IS_ERR(fc)) {
>>

Re: [PATCH v3 1/2] sysctl: handle overflow in proc_get_long

2018-10-16 Thread Eric W. Biederman

Christian Brauner  writes:

> proc_get_long() is a funny function. It uses simple_strtoul() and for a
> good reason. proc_get_long() wants to always succeed the parse and return
> the maybe incorrect value and the trailing characters to check against a
> pre-defined list of acceptable trailing values.
> However, simple_strtoul() explicitly ignores overflows which can cause
> funny things like the following to happen:
>
> echo 18446744073709551616 > /proc/sys/fs/file-max
> cat /proc/sys/fs/file-max
> 0
>
> (Which will cause your system to silently die behind your back.)
>
> On the other hand kstrtoul() does do overflow detection but does not return
> the trailing characters, and also fails the parse when anything other than
> '\n' is a trailing character whereas proc_get_long() wants to be more
> lenient.
>
> Now, before adding another kstrtoul() function let's simply add a static
> parse strtoul_lenient() which:
> - fails on overflow with -ERANGE
> - returns the trailing characters to the caller
>
> The reason why we should fail on ERANGE is that we already do a partial
> fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So we
> already reject values such as 184467440737095516160 (21 chars) but accept
> values such as 18446744073709551616 (20 chars) but both are overflows. So
> we should just always reject 64bit overflows and not special-case this
> based on the number of chars.
>
> Acked-by: Kees Cook 
> Signed-off-by: Christian Brauner 
> Signed-off-by: Christian Brauner 
> ---
> v2->v3:
> - (Kees) s/#include <../lib/kstrtox.h>/#include "../lib/kstrtox.h"/g
> - (Kees) document strtoul_lenient()
>
> v1->v2:
> - s/sysctl_cap_erange/sysctl_lenient/g
> - consistenly fail on overflow
>
> v0->v1:
> - s/sysctl_strtoul_lenient/strtoul_cap_erange/g
> - (Al) remove bool overflow return argument from strtoul_cap_erange
> - (Al) return ULONG_MAX on ERANGE from strtoul_cap_erange
> - (Dominik) fix spelling in commit message
> ---
>  kernel/sysctl.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cc02050fd0c4..102aa7a65687 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -68,6 +68,8 @@
>  #include 
>  #include 
>  
> +#include "../lib/kstrtox.h"
> +
>  #include 
>  #include 
>  
> @@ -2065,6 +2067,41 @@ static void proc_skip_char(char **buf, size_t *size, 
> const char v)
>   }
>  }
>  
> +/**
> + * strtoul_lenient - parse an ASCII formatted integer from a buffer and only
> + *   fail on overflow
> + *
> + * @cp: kernel buffer containing the string to parse
> + * @endp: pointer to store the trailing characters
> + * @base: the base to use
> + * @res: where the parsed integer will be stored
> + *
> + * In case of success 0 is returned and @res will contain the parsed integer,
> + * @endp will hold any trailing characters.
> + * This function will fail the parse on overflow. If there wasn't an overflow
> + * the function will defer the decision what characters count as invalid to 
> the
> + * caller.
> + */
> +static int strtoul_lenient(const char *cp, char **endp, unsigned int base,
> +unsigned long *res)
> +{
> + unsigned long long result;
> + unsigned int rv;
> +
> + cp = _parse_integer_fixup_radix(cp, );
> + rv = _parse_integer(cp, base, );
> + if ((rv & KSTRTOX_OVERFLOW) || (result != (unsigned long)result))
> + return -ERANGE;
> +
> + cp += rv;
> +
> + if (endp)
> + *endp = (char *)cp;
> +
> + *res = (unsigned long)result;
> + return 0;
> +}
> +
>  #define TMPBUFLEN 22
>  /**
>   * proc_get_long - reads an ASCII formatted integer from a user buffer
> @@ -2108,7 +2145,8 @@ static int proc_get_long(char **buf, size_t *size,
>   if (!isdigit(*p))
>   return -EINVAL;
>  
> - *val = simple_strtoul(p, , 0);
> + if (strtoul_lenient(p, , 0, val))
> + return -EINVAL;

Is it deliberate that on an error stroul_lenient returns -ERANGE but
then proc_get_long returns -EINVAL?  That feels wrong.  The write system
call does not permit -ERANGE or -EINVAL for the contents of the data
so both options appear equally bad from a standards point of view.

I am just wondering what the thinking is here.

>   len = p - tmp;

Re: [PATCH v3 1/2] sysctl: handle overflow in proc_get_long

2018-10-16 Thread Eric W. Biederman

Christian Brauner  writes:

> proc_get_long() is a funny function. It uses simple_strtoul() and for a
> good reason. proc_get_long() wants to always succeed the parse and return
> the maybe incorrect value and the trailing characters to check against a
> pre-defined list of acceptable trailing values.
> However, simple_strtoul() explicitly ignores overflows which can cause
> funny things like the following to happen:
>
> echo 18446744073709551616 > /proc/sys/fs/file-max
> cat /proc/sys/fs/file-max
> 0
>
> (Which will cause your system to silently die behind your back.)
>
> On the other hand kstrtoul() does do overflow detection but does not return
> the trailing characters, and also fails the parse when anything other than
> '\n' is a trailing character whereas proc_get_long() wants to be more
> lenient.
>
> Now, before adding another kstrtoul() function let's simply add a static
> parse strtoul_lenient() which:
> - fails on overflow with -ERANGE
> - returns the trailing characters to the caller
>
> The reason why we should fail on ERANGE is that we already do a partial
> fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So we
> already reject values such as 184467440737095516160 (21 chars) but accept
> values such as 18446744073709551616 (20 chars) but both are overflows. So
> we should just always reject 64bit overflows and not special-case this
> based on the number of chars.
>
> Acked-by: Kees Cook 
> Signed-off-by: Christian Brauner 
> Signed-off-by: Christian Brauner 
> ---
> v2->v3:
> - (Kees) s/#include <../lib/kstrtox.h>/#include "../lib/kstrtox.h"/g
> - (Kees) document strtoul_lenient()
>
> v1->v2:
> - s/sysctl_cap_erange/sysctl_lenient/g
> - consistenly fail on overflow
>
> v0->v1:
> - s/sysctl_strtoul_lenient/strtoul_cap_erange/g
> - (Al) remove bool overflow return argument from strtoul_cap_erange
> - (Al) return ULONG_MAX on ERANGE from strtoul_cap_erange
> - (Dominik) fix spelling in commit message
> ---
>  kernel/sysctl.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cc02050fd0c4..102aa7a65687 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -68,6 +68,8 @@
>  #include 
>  #include 
>  
> +#include "../lib/kstrtox.h"
> +
>  #include 
>  #include 
>  
> @@ -2065,6 +2067,41 @@ static void proc_skip_char(char **buf, size_t *size, 
> const char v)
>   }
>  }
>  
> +/**
> + * strtoul_lenient - parse an ASCII formatted integer from a buffer and only
> + *   fail on overflow
> + *
> + * @cp: kernel buffer containing the string to parse
> + * @endp: pointer to store the trailing characters
> + * @base: the base to use
> + * @res: where the parsed integer will be stored
> + *
> + * In case of success 0 is returned and @res will contain the parsed integer,
> + * @endp will hold any trailing characters.
> + * This function will fail the parse on overflow. If there wasn't an overflow
> + * the function will defer the decision what characters count as invalid to 
> the
> + * caller.
> + */
> +static int strtoul_lenient(const char *cp, char **endp, unsigned int base,
> +unsigned long *res)
> +{
> + unsigned long long result;
> + unsigned int rv;
> +
> + cp = _parse_integer_fixup_radix(cp, );
> + rv = _parse_integer(cp, base, );
> + if ((rv & KSTRTOX_OVERFLOW) || (result != (unsigned long)result))
> + return -ERANGE;
> +
> + cp += rv;
> +
> + if (endp)
> + *endp = (char *)cp;
> +
> + *res = (unsigned long)result;
> + return 0;
> +}
> +
>  #define TMPBUFLEN 22
>  /**
>   * proc_get_long - reads an ASCII formatted integer from a user buffer
> @@ -2108,7 +2145,8 @@ static int proc_get_long(char **buf, size_t *size,
>   if (!isdigit(*p))
>   return -EINVAL;
>  
> - *val = simple_strtoul(p, , 0);
> + if (strtoul_lenient(p, , 0, val))
> + return -EINVAL;

Is it deliberate that on an error stroul_lenient returns -ERANGE but
then proc_get_long returns -EINVAL?  That feels wrong.  The write system
call does not permit -ERANGE or -EINVAL for the contents of the data
so both options appear equally bad from a standards point of view.

I am just wondering what the thinking is here.

>   len = p - tmp;

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman

Enke Chen  writes:

> Hi, Eric:
>
> On 10/15/18 4:28 PM, Eric W. Biederman wrote:

>> With that said I think the best solution would be to figure out how to
>> allow the coredump to run in parallel with the usual exit signal, and
>> exit code reaping of the process> 
>> That would solve the problem for everyone, and would not introduce any
>> new complicated APIs.
>
> That would certainly help. But given the huge deployment of Linux, I don't
> think it would be feasible to change this fundamental behavior (signal post
> coredump).

Of course it will be feasible to change.  Make it a sysctl and keep the
current default and no one will even notice.  Waiting for something that
is happening asynchronously is not be difficult so having the wait
optional should not be a problem.

Right now the default in most distributions is to disable core dumps
entirely.   Which means that you are going to have to find a very
specific situation in which people and applications care about core
dumps happening to break an existing setup.

Then all you have to do to get the non-blocking behavior is to just do:
echo 1 > /proc/sys/kernel_core_async

Then everything else works without modifications and everyone is happy.
Maybe I am wearing rose colored glasses but that looks like all that is
needed and it should be much easier to work with and maintain than
having to modify every manager process to listen for unreliable signals,
and then take action on those unreliable signals.

Eric

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman

Enke Chen  writes:

> Hi, Eric:
>
> On 10/15/18 4:28 PM, Eric W. Biederman wrote:

>> With that said I think the best solution would be to figure out how to
>> allow the coredump to run in parallel with the usual exit signal, and
>> exit code reaping of the process> 
>> That would solve the problem for everyone, and would not introduce any
>> new complicated APIs.
>
> That would certainly help. But given the huge deployment of Linux, I don't
> think it would be feasible to change this fundamental behavior (signal post
> coredump).

Of course it will be feasible to change.  Make it a sysctl and keep the
current default and no one will even notice.  Waiting for something that
is happening asynchronously is not be difficult so having the wait
optional should not be a problem.

Right now the default in most distributions is to disable core dumps
entirely.   Which means that you are going to have to find a very
specific situation in which people and applications care about core
dumps happening to break an existing setup.

Then all you have to do to get the non-blocking behavior is to just do:
echo 1 > /proc/sys/kernel_core_async

Then everything else works without modifications and everyone is happy.
Maybe I am wearing rose colored glasses but that looks like all that is
needed and it should be much easier to work with and maintain than
having to modify every manager process to listen for unreliable signals,
and then take action on those unreliable signals.

Eric

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman

Oleg Nesterov  writes:

> On 10/15, Enke Chen wrote:
>>
>> > I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.

I will just add that CLD_XXX is only valid with SIGCHLD as they are
signal specific si_codes.  In conjunction with another signal like
SIGUSR it will have another meaning.  I would really appreciate it
if new code does not further complicate siginfo_layout.

>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.

We do best effort queueing but we don't guarantee anything.  So yes
this makes signals a very louzy interface for sending this kind of
information.

>> >>   if (sig_kernel_coredump(signr)) {
>> >> + /*
>> >> +  * Notify the parent prior to the coredump if the
>> >> +  * parent is interested in such a notificaiton.
>> >> +  */
>> >> + int p_sig = current->real_parent->predump_signal;
>> >> +
>> >> + if (valid_predump_signal(p_sig)) {
>> >> + read_lock(_lock);
>> >> + do_notify_parent_predump(current);
>> >> + read_unlock(_lock);
>> >> + cond_resched();
>> >
>> > perhaps this should be called by do_coredump() after coredump_wait() kills
>> > all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes.  It isn't until do_coredump calls coredump_wait that all of the
threads are killed.

Eric

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Eric W. Biederman

Oleg Nesterov  writes:

> On 10/15, Enke Chen wrote:
>>
>> > I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.

I will just add that CLD_XXX is only valid with SIGCHLD as they are
signal specific si_codes.  In conjunction with another signal like
SIGUSR it will have another meaning.  I would really appreciate it
if new code does not further complicate siginfo_layout.

>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.

We do best effort queueing but we don't guarantee anything.  So yes
this makes signals a very louzy interface for sending this kind of
information.

>> >>   if (sig_kernel_coredump(signr)) {
>> >> + /*
>> >> +  * Notify the parent prior to the coredump if the
>> >> +  * parent is interested in such a notificaiton.
>> >> +  */
>> >> + int p_sig = current->real_parent->predump_signal;
>> >> +
>> >> + if (valid_predump_signal(p_sig)) {
>> >> + read_lock(_lock);
>> >> + do_notify_parent_predump(current);
>> >> + read_unlock(_lock);
>> >> + cond_resched();
>> >
>> > perhaps this should be called by do_coredump() after coredump_wait() kills
>> > all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes.  It isn't until do_coredump calls coredump_wait that all of the
threads are killed.

Eric

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Eric W. Biederman

Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

You talk about time senstive and then you talk about bash scripts.
I don't think your definition of time-sensitive and my definition match.

With that said I think the best solution would be to figure out how to
allow the coredump to run in parallel with the usual exit signal, and
exit code reaping of the process.

That would solve the problem for everyone, and would not introduce any
new complicated APIs.

Short of that having the prctl in the process that receives the signals
they you are doing is the right way to go.

You are however calling do_notify_parent_predump from the wrong
function, and frankly with the wrong locking.  There are multiple paths
to the do_coredump function so you really want this notification from
do_coredump.

But I still think it would be better to solve the root cause problem and
change the coredump logic to be able to run in parallel with the normal
exit notification and zombie reaping logic.  Then the problem you are
trying to solve goes away and everyone's code gets simpler.

Eric

> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
>
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
>
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Eric W. Biederman

Enke Chen  writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

You talk about time senstive and then you talk about bash scripts.
I don't think your definition of time-sensitive and my definition match.

With that said I think the best solution would be to figure out how to
allow the coredump to run in parallel with the usual exit signal, and
exit code reaping of the process.

That would solve the problem for everyone, and would not introduce any
new complicated APIs.

Short of that having the prctl in the process that receives the signals
they you are doing is the right way to go.

You are however calling do_notify_parent_predump from the wrong
function, and frankly with the wrong locking.  There are multiple paths
to the do_coredump function so you really want this notification from
do_coredump.

But I still think it would be better to solve the root cause problem and
change the coredump logic to be able to run in parallel with the normal
exit notification and zombie reaping logic.  Then the problem you are
trying to solve goes away and everyone's code gets simpler.

Eric

> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
>
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
>
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen 
> ---
>  arch/x86/kernel/signal_compat.c|  2 +-
>  include/linux/sched.h  |  4 ++
>  include/linux/signal.h |  5 +++
>  include/uapi/asm-generic/siginfo.h |  3 +-
>  include/uapi/linux/prctl.h |  4 ++
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 51 +
>  kernel/sys.c   | 77 
> ++
>  8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
>   BUILD_BUG_ON(NSIGSEGV != 7);
>   BUILD_BUG_ON(NSIGBUS  != 5);
>   BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
>   BUILD_BUG_ON(NSIGSYS  != 1);
>  
>   /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   int exit_signal;
>   /* The signal sent when the parent dies: */
>   int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
>   /* JOBCTL_*, siglock protected: */
>   unsigned long   jobctl;
>  
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
>   return sig

Re: [RFC] Allow user namespace inside chroot

2018-10-15 Thread Eric W. Biederman

Jann Horn  writes:

> On Mon, Oct 15, 2018 at 7:10 PM  wrote:
>> @@ -1281,7 +1285,7 @@ static int userns_install(struct nsproxy *nsproxy, 
>> struct ns_common *ns)
>> return -ENOMEM;
>>
>> put_user_ns(cred->user_ns);
>> -   set_cred_user_ns(cred, get_user_ns(user_ns));
>> +   set_cred_user_ns(cred, get_user_ns(user_ns), 0);
>
> This looks bogus. With this, I think your restriction can be bypassed
> if process A forks a child B, B creates a new user namespace, then A
> enters the user namespace with setns() and has full capabilities. Am I
> missing something?

Nope.  I feel silly for missing that angle.

Even without the full capabilities the userns_install angle will place
you at the root of the mount namespace outside of the chroot.

At which point I have visions of the special cases multiplying like
bunnies make this work.  Without a very strong case I don't like this at all.

Eric

Re: [RFC] Allow user namespace inside chroot

2018-10-15 Thread Eric W. Biederman

Jann Horn  writes:

> On Mon, Oct 15, 2018 at 7:10 PM  wrote:
>> @@ -1281,7 +1285,7 @@ static int userns_install(struct nsproxy *nsproxy, 
>> struct ns_common *ns)
>> return -ENOMEM;
>>
>> put_user_ns(cred->user_ns);
>> -   set_cred_user_ns(cred, get_user_ns(user_ns));
>> +   set_cred_user_ns(cred, get_user_ns(user_ns), 0);
>
> This looks bogus. With this, I think your restriction can be bypassed
> if process A forks a child B, B creates a new user namespace, then A
> enters the user namespace with setns() and has full capabilities. Am I
> missing something?

Nope.  I feel silly for missing that angle.

Even without the full capabilities the userns_install angle will place
you at the root of the mount namespace outside of the chroot.

At which point I have visions of the special cases multiplying like
bunnies make this work.  Without a very strong case I don't like this at all.

Eric

Re: [RFC] Allow user namespace inside chroot

2018-10-15 Thread Eric W. Biederman



Have you considered using pivot_root to drop all of the pieces of the
filesystem you don't want to be visible?  That should be a much better
solution overall.

It is must a matter of:
mount --bind /path/you/would/chroot/to
pivot_root /path/you/would/chroot/to /put/old
umount -l /put/old

You might need to do something like make --rprivate before calling
pivot_root to stop mount propagation to the parent.  But I can't
image it to be a practical problem.


Also note that being in a chroot tends to indicate one of two things,
being in an old build system, or being in some kind of chroot jail.
Because of the jails created with chroot we want to be very careful
with enabling user namespaces in that context.

There have been some very clever people figuring out how to get out of
chroot jails by passing file descriptors between processes and using
things like pivot root.

Even if your analysis is semantically perfect there is the issue of
increasing the attack surface of preexising chroot jails.  I believe
that would make the kernel more vulnerable overall, and for only
a very small simplification of implementation details.

So unless I am missing something I don't see the use case for this that
would not be better served by just properly setting up your mount
namespace, and the attack surface increase of chroot jails makes we
very relucatant to see a change like this.

Eric

nagarathnam.muthus...@oracle.com writes:

> From: Nagarathnam Muthusamy 
>
> Following commit disables the creation of user namespace inside
> the chroot environment.
>
> userns: Don't allow creation if the user is chrooted
>
> commit 3151527ee007b73a0ebd296010f1c0454a919c7d
>
> Consider a system in which a non-root user creates a combination
> of user, pid and mount namespaces and confines a process to it.
> The system will have multiple levels of nested namespaces.
> The root namespace in the system will have lots of directories
> which should not be exposed to the child confined to the set of
> namespaces.
>
> Without chroot, we will have to hide all unwanted directories
> individually using bind mounts and mount namespace. Chroot enables
> us to expose a handpicked list of directories which the child
> can see but if we use chroot we wont be able to create nested
> namespaces.
>
> Allowing a process to create user namespace within a chroot
> environment will enable it to chroot, which in turn can be used
> to escape the jail.
>
> This patch drops the chroot privilege when user namespace is
> created within the chroot environment so the process cannot
> use it to escape the chroot jail. The process can still modify
> the view of the file system using mount namespace but for those
> modifications to be useful, it needs to run a setuid program with
> that intented uid directly mapped into the user namespace as it is
> which is not possible for an unprivileged process.
>
> If there were any other corner cases which were considered while
> deciding to disable the creation of user namespace as a whole
> within the chroot environment please let me know.
>
> Signed-off-by: Nagarathnam Muthusamy
> ---
>  kernel/user_namespace.c | 22 +-
>  1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index e5222b5..83d2a70 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -44,7 +44,7 @@ static void dec_user_namespaces(struct ucounts *ucounts)
>   return dec_ucount(ucounts, UCOUNT_USER_NAMESPACES);
>  }
>  
> -static void set_cred_user_ns(struct cred *cred, struct user_namespace 
> *user_ns)
> +static void set_cred_user_ns(struct cred *cred, struct user_namespace 
> *user_ns, int is_chrooted)
>  {
>   /* Start with the same capabilities as init but useless for doing
>* anything as the capabilities are bound to the new user namespace.
> @@ -55,6 +55,11 @@ static void set_cred_user_ns(struct cred *cred, struct 
> user_namespace *user_ns)
>   cred->cap_effective = CAP_FULL_SET;
>   cred->cap_ambient = CAP_EMPTY_SET;
>   cred->cap_bset = CAP_FULL_SET;
> + if (is_chrooted) {
> + cap_lower(cred->cap_permitted, CAP_SYS_CHROOT);
> + cap_lower(cred->cap_effective, CAP_SYS_CHROOT);
> + cap_lower(cred->cap_bset, CAP_SYS_CHROOT);
> + }
>  #ifdef CONFIG_KEYS
>   key_put(cred->request_key_auth);
>   cred->request_key_auth = NULL;
> @@ -78,6 +83,7 @@ int create_user_ns(struct cred *new)
>   kgid_t group = new->egid;
>   struct ucounts *ucounts;
>   int ret, i;
> + int is_chrooted = 0;
>  
>   ret = -ENOSPC;
>   if (parent_ns->level > 32)
> @@ -88,14 +94,12 @@ int create_user_ns(struct cred *new)
>   goto fail;
>  
>   /*
> -  * Verify that we can not violate the policy of which files
> -  * may be accessed that is specified by the root directory,
> -  * by verifing that the root directory is at the root of the
> -  * mount

Re: [RFC] Allow user namespace inside chroot

2018-10-15 Thread Eric W. Biederman



Have you considered using pivot_root to drop all of the pieces of the
filesystem you don't want to be visible?  That should be a much better
solution overall.

It is must a matter of:
mount --bind /path/you/would/chroot/to
pivot_root /path/you/would/chroot/to /put/old
umount -l /put/old

You might need to do something like make --rprivate before calling
pivot_root to stop mount propagation to the parent.  But I can't
image it to be a practical problem.


Also note that being in a chroot tends to indicate one of two things,
being in an old build system, or being in some kind of chroot jail.
Because of the jails created with chroot we want to be very careful
with enabling user namespaces in that context.

There have been some very clever people figuring out how to get out of
chroot jails by passing file descriptors between processes and using
things like pivot root.

Even if your analysis is semantically perfect there is the issue of
increasing the attack surface of preexising chroot jails.  I believe
that would make the kernel more vulnerable overall, and for only
a very small simplification of implementation details.

So unless I am missing something I don't see the use case for this that
would not be better served by just properly setting up your mount
namespace, and the attack surface increase of chroot jails makes we
very relucatant to see a change like this.

Eric

nagarathnam.muthus...@oracle.com writes:

> From: Nagarathnam Muthusamy 
>
> Following commit disables the creation of user namespace inside
> the chroot environment.
>
> userns: Don't allow creation if the user is chrooted
>
> commit 3151527ee007b73a0ebd296010f1c0454a919c7d
>
> Consider a system in which a non-root user creates a combination
> of user, pid and mount namespaces and confines a process to it.
> The system will have multiple levels of nested namespaces.
> The root namespace in the system will have lots of directories
> which should not be exposed to the child confined to the set of
> namespaces.
>
> Without chroot, we will have to hide all unwanted directories
> individually using bind mounts and mount namespace. Chroot enables
> us to expose a handpicked list of directories which the child
> can see but if we use chroot we wont be able to create nested
> namespaces.
>
> Allowing a process to create user namespace within a chroot
> environment will enable it to chroot, which in turn can be used
> to escape the jail.
>
> This patch drops the chroot privilege when user namespace is
> created within the chroot environment so the process cannot
> use it to escape the chroot jail. The process can still modify
> the view of the file system using mount namespace but for those
> modifications to be useful, it needs to run a setuid program with
> that intented uid directly mapped into the user namespace as it is
> which is not possible for an unprivileged process.
>
> If there were any other corner cases which were considered while
> deciding to disable the creation of user namespace as a whole
> within the chroot environment please let me know.
>
> Signed-off-by: Nagarathnam Muthusamy
> ---
>  kernel/user_namespace.c | 22 +-
>  1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index e5222b5..83d2a70 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -44,7 +44,7 @@ static void dec_user_namespaces(struct ucounts *ucounts)
>   return dec_ucount(ucounts, UCOUNT_USER_NAMESPACES);
>  }
>  
> -static void set_cred_user_ns(struct cred *cred, struct user_namespace 
> *user_ns)
> +static void set_cred_user_ns(struct cred *cred, struct user_namespace 
> *user_ns, int is_chrooted)
>  {
>   /* Start with the same capabilities as init but useless for doing
>* anything as the capabilities are bound to the new user namespace.
> @@ -55,6 +55,11 @@ static void set_cred_user_ns(struct cred *cred, struct 
> user_namespace *user_ns)
>   cred->cap_effective = CAP_FULL_SET;
>   cred->cap_ambient = CAP_EMPTY_SET;
>   cred->cap_bset = CAP_FULL_SET;
> + if (is_chrooted) {
> + cap_lower(cred->cap_permitted, CAP_SYS_CHROOT);
> + cap_lower(cred->cap_effective, CAP_SYS_CHROOT);
> + cap_lower(cred->cap_bset, CAP_SYS_CHROOT);
> + }
>  #ifdef CONFIG_KEYS
>   key_put(cred->request_key_auth);
>   cred->request_key_auth = NULL;
> @@ -78,6 +83,7 @@ int create_user_ns(struct cred *new)
>   kgid_t group = new->egid;
>   struct ucounts *ucounts;
>   int ret, i;
> + int is_chrooted = 0;
>  
>   ret = -ENOSPC;
>   if (parent_ns->level > 32)
> @@ -88,14 +94,12 @@ int create_user_ns(struct cred *new)
>   goto fail;
>  
>   /*
> -  * Verify that we can not violate the policy of which files
> -  * may be accessed that is specified by the root directory,
> -  * by verifing that the root directory is at the root of the
> -  * mount

Re: linux-next: manual merge of the userns tree with the tip tree

2018-10-14 Thread Eric W. Biederman

Stephen Rothwell  writes:

> Hi all,
>
> On Mon, 15 Oct 2018 15:11:59 +1100 Stephen Rothwell  
> wrote:
>>
>> Today's linux-next merge of the userns tree got a conflict in:
>> 
>>   arch/x86/mm/fault.c
>> 
>> between commit:
>> 
>>   164477c2331b ("x86/mm: Clarify hardware vs. software "error_code"")
>> (and others from that series)
>> 
>> from the tip tree and commits:
>> 
>>   768fd9c69bb5 ("signal/x86: Remove pkey parameter from 
>> bad_area_nosemaphore")
>>   25c102d803ea ("signal/x86: Remove pkey parameter from mm_fault_error")
>> 
>> from the userns tree.
>> 
>> I fixed it up (I think - see below) and can carry the fix as
>> necessary. This is now fixed as far as linux-next is concerned, but any
>> non trivial conflicts should be mentioned to your upstream maintainer
>> when your tree is submitted for merging.  You may also want to consider
>> cooperating with the maintainer of the conflicting tree to minimise any
>> particularly complex conflicts.
>> 
>> -- 
>> Cheers,
>> Stephen Rothwell
>> 
>> diff --cc arch/x86/mm/fault.c
>> index c2e3e5127ebc,8d77700a7883..
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>>  +/* Handle faults in the user portion of the address space */
>>  +static inline
>>  +void do_user_addr_fault(struct pt_regs *regs,
>>  +   unsigned long hw_error_code,
>>  +   unsigned long address)
>>  +{
>>  +   unsigned long sw_error_code;
>>  +   struct vm_area_struct *vma;
>>  +   struct task_struct *tsk;
>>  +   struct mm_struct *mm;
>>  +   vm_fault_t fault, major = 0;
>>  +   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>>  +   u32 pkey;
>
> I missed removing the above line.

Yes.  At first glance with the above change it looks like you got it.

Eric

Re: linux-next: manual merge of the userns tree with the tip tree

2018-10-14 Thread Eric W. Biederman

Stephen Rothwell  writes:

> Hi all,
>
> On Mon, 15 Oct 2018 15:11:59 +1100 Stephen Rothwell  
> wrote:
>>
>> Today's linux-next merge of the userns tree got a conflict in:
>> 
>>   arch/x86/mm/fault.c
>> 
>> between commit:
>> 
>>   164477c2331b ("x86/mm: Clarify hardware vs. software "error_code"")
>> (and others from that series)
>> 
>> from the tip tree and commits:
>> 
>>   768fd9c69bb5 ("signal/x86: Remove pkey parameter from 
>> bad_area_nosemaphore")
>>   25c102d803ea ("signal/x86: Remove pkey parameter from mm_fault_error")
>> 
>> from the userns tree.
>> 
>> I fixed it up (I think - see below) and can carry the fix as
>> necessary. This is now fixed as far as linux-next is concerned, but any
>> non trivial conflicts should be mentioned to your upstream maintainer
>> when your tree is submitted for merging.  You may also want to consider
>> cooperating with the maintainer of the conflicting tree to minimise any
>> particularly complex conflicts.
>> 
>> -- 
>> Cheers,
>> Stephen Rothwell
>> 
>> diff --cc arch/x86/mm/fault.c
>> index c2e3e5127ebc,8d77700a7883..
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>>  +/* Handle faults in the user portion of the address space */
>>  +static inline
>>  +void do_user_addr_fault(struct pt_regs *regs,
>>  +   unsigned long hw_error_code,
>>  +   unsigned long address)
>>  +{
>>  +   unsigned long sw_error_code;
>>  +   struct vm_area_struct *vma;
>>  +   struct task_struct *tsk;
>>  +   struct mm_struct *mm;
>>  +   vm_fault_t fault, major = 0;
>>  +   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>>  +   u32 pkey;
>
> I missed removing the above line.

Yes.  At first glance with the above change it looks like you got it.

Eric

Re: [tip:x86/mm] x86/mm: Break out user address space handling

2018-10-14 Thread Eric W. Biederman

tip-bot for Dave Hansen  writes:

> Commit-ID:  aa37c51b9421d66f7931c5fdcb9ce80c450974be
> Gitweb: 
> https://git.kernel.org/tip/aa37c51b9421d66f7931c5fdcb9ce80c450974be
> Author: Dave Hansen 
> AuthorDate: Fri, 28 Sep 2018 09:02:23 -0700
> Committer:  Peter Zijlstra 
> CommitDate: Tue, 9 Oct 2018 16:51:15 +0200
>
> x86/mm: Break out user address space handling
>
> The last patch broke out kernel address space handing into its own
> helper.  Now, do the same for user address space handling.
>
> Cc: x...@kernel.org
> Cc: Jann Horn 
> Cc: Sean Christopherson 
> Cc: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Signed-off-by: Dave Hansen 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: http://lkml.kernel.org/r/20180928160223.9c4f6...@viggo.jf.intel.com
> ---
>  arch/x86/mm/fault.c | 47 ---
>  1 file changed, 28 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index c7e32f453852..0d1f5d39fc63 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -966,6 +966,7 @@ bad_area_access_error(struct pt_regs *regs, unsigned long 
> error_code,
>   __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
>  }
>  
> +/* Handle faults in the kernel portion of the address space */
   ^^
I believe you mean the __user__ portion of the address space.
Given that the call chain is:

do_user_addr_fault
   handle_mm_fault
  do_sigbus  

>  static void
>  do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long 
> address,
> u32 *pkey, unsigned int fault)
> @@ -1254,14 +1255,11 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned 
> long hw_error_code,
>  }
>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>  
> -/*
> - * This routine handles page faults.  It determines the address,
> - * and the problem, and then passes it off to one of the appropriate
> - * routines.
> - */
> -static noinline void
> -__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> - unsigned long address)
> +/* Handle faults in the user portion of the address space */
> +static inline
> +void do_user_addr_fault(struct pt_regs *regs,
> + unsigned long hw_error_code,
> + unsigned long address)
>  {
>   unsigned long sw_error_code;
>   struct vm_area_struct *vma;
> @@ -1274,17 +1272,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> hw_error_code,
>   tsk = current;
>   mm = tsk->mm;
>  
> - prefetchw(>mmap_sem);
> -
> - if (unlikely(kmmio_fault(regs, address)))
> - return;
> -
> - /* Was the fault on kernel-controlled part of the address space? */
> - if (unlikely(fault_in_kernel_space(address))) {
> - do_kern_addr_fault(regs, hw_error_code, address);
> - return;
> - }
> -
>   /* kprobes don't want to hook the spurious faults: */
>   if (unlikely(kprobes_fault(regs)))
>   return;
> @@ -1488,6 +1475,28 @@ good_area:
>  
>   check_v8086_mode(regs, address, tsk);
>  }
> +NOKPROBE_SYMBOL(do_user_addr_fault);
> +
> +/*
> + * This routine handles page faults.  It determines the address,
> + * and the problem, and then passes it off to one of the appropriate
> + * routines.
> + */
> +static noinline void
> +__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{
> + prefetchw(>mm->mmap_sem);
> +
> + if (unlikely(kmmio_fault(regs, address)))
> + return;
> +
> + /* Was the fault on kernel-controlled part of the address space? */
> + if (unlikely(fault_in_kernel_space(address)))
> + do_kern_addr_fault(regs, hw_error_code, address);
> + else
> + do_user_addr_fault(regs, hw_error_code, address);
> +}
>  NOKPROBE_SYMBOL(__do_page_fault);
>  
>  static nokprobe_inline void

Eric

Re: [tip:x86/mm] x86/mm: Break out user address space handling

2018-10-14 Thread Eric W. Biederman

tip-bot for Dave Hansen  writes:

> Commit-ID:  aa37c51b9421d66f7931c5fdcb9ce80c450974be
> Gitweb: 
> https://git.kernel.org/tip/aa37c51b9421d66f7931c5fdcb9ce80c450974be
> Author: Dave Hansen 
> AuthorDate: Fri, 28 Sep 2018 09:02:23 -0700
> Committer:  Peter Zijlstra 
> CommitDate: Tue, 9 Oct 2018 16:51:15 +0200
>
> x86/mm: Break out user address space handling
>
> The last patch broke out kernel address space handing into its own
> helper.  Now, do the same for user address space handling.
>
> Cc: x...@kernel.org
> Cc: Jann Horn 
> Cc: Sean Christopherson 
> Cc: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Signed-off-by: Dave Hansen 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: http://lkml.kernel.org/r/20180928160223.9c4f6...@viggo.jf.intel.com
> ---
>  arch/x86/mm/fault.c | 47 ---
>  1 file changed, 28 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index c7e32f453852..0d1f5d39fc63 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -966,6 +966,7 @@ bad_area_access_error(struct pt_regs *regs, unsigned long 
> error_code,
>   __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
>  }
>  
> +/* Handle faults in the kernel portion of the address space */
   ^^
I believe you mean the __user__ portion of the address space.
Given that the call chain is:

do_user_addr_fault
   handle_mm_fault
  do_sigbus  

>  static void
>  do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long 
> address,
> u32 *pkey, unsigned int fault)
> @@ -1254,14 +1255,11 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned 
> long hw_error_code,
>  }
>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>  
> -/*
> - * This routine handles page faults.  It determines the address,
> - * and the problem, and then passes it off to one of the appropriate
> - * routines.
> - */
> -static noinline void
> -__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> - unsigned long address)
> +/* Handle faults in the user portion of the address space */
> +static inline
> +void do_user_addr_fault(struct pt_regs *regs,
> + unsigned long hw_error_code,
> + unsigned long address)
>  {
>   unsigned long sw_error_code;
>   struct vm_area_struct *vma;
> @@ -1274,17 +1272,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> hw_error_code,
>   tsk = current;
>   mm = tsk->mm;
>  
> - prefetchw(>mmap_sem);
> -
> - if (unlikely(kmmio_fault(regs, address)))
> - return;
> -
> - /* Was the fault on kernel-controlled part of the address space? */
> - if (unlikely(fault_in_kernel_space(address))) {
> - do_kern_addr_fault(regs, hw_error_code, address);
> - return;
> - }
> -
>   /* kprobes don't want to hook the spurious faults: */
>   if (unlikely(kprobes_fault(regs)))
>   return;
> @@ -1488,6 +1475,28 @@ good_area:
>  
>   check_v8086_mode(regs, address, tsk);
>  }
> +NOKPROBE_SYMBOL(do_user_addr_fault);
> +
> +/*
> + * This routine handles page faults.  It determines the address,
> + * and the problem, and then passes it off to one of the appropriate
> + * routines.
> + */
> +static noinline void
> +__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{
> + prefetchw(>mm->mmap_sem);
> +
> + if (unlikely(kmmio_fault(regs, address)))
> + return;
> +
> + /* Was the fault on kernel-controlled part of the address space? */
> + if (unlikely(fault_in_kernel_space(address)))
> + do_kern_addr_fault(regs, hw_error_code, address);
> + else
> + do_user_addr_fault(regs, hw_error_code, address);
> +}
>  NOKPROBE_SYMBOL(__do_page_fault);
>  
>  static nokprobe_inline void

Eric

Re: [LKP] 601d5abfea [ 13.686356] BUG: unable to handle kernel paging request at 34ca027e

2018-10-12 Thread Eric W. Biederman

kernel test robot  writes:

> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> siginfo-next
>

This is an odd one.  This is the first time I recall the lkp test robot
emailing me and telling me that I have fixed an issue.  I don't mind I
am just a little surprised.

Recently it was discovered that on my branch passing a large negative signal
number to rt_sigqueueinfo could cause an oops.  The underlying issues
were fixed by:

b2a2ab527d6d ("signal: Guard against negative signal numbers in 
copy_siginfo_from_user")
a36700589b85 ("signal: Guard against negative signal numbers in 
copy_siginfo_from_user32")

It appears that this issue was found in linux-next before these fixes
were applied.  And then the top of my siginfo-next branch where these
fixes exist was tested.

I have not problem with that sequence of events it is just a little
surprising.

If I have not read this test report correctly please let me know.

Eric

> commit 601d5abfeaf244b86bb68c1e05c6e0d57be2f6b0
> Author: Eric W. Biederman 
> AuthorDate: Fri Oct 5 09:02:48 2018 +0200
> Commit: Eric W. Biederman 
> CommitDate: Mon Oct 8 09:35:26 2018 +0200
>
> signal: In sigqueueinfo prefer sig not si_signo
> 
> Andrei Vagin  reported:
> 
> > Accoding to the man page, the user should not set si_signo, it has to 
> be set
> > by kernel.
> >
> > $ man 2 rt_sigqueueinfo
> >
> > The uinfo argument specifies the data to accompany  the  signal.   
> This
> >argument  is  a  pointer to a structure of type siginfo_t, 
> described in
> >sigaction(2) (and defined  by  including  ).   The  
> caller
> >should set the following fields in this structure:
> >
> >si_code
> >   This  must  be  one of the SI_* codes in the Linux kernel 
> source
> >   file include/asm-generic/siginfo.h, with  the  
> restriction  that
> >   the  code  must  be  negative (i.e., cannot be SI_USER, 
> which is
> >   used by the kernel to indicate a signal  sent  by  
> kill(2))  and
> >   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used 
> by the
> >   kernel to indicate a signal sent using tgkill(2)).
> >
> >si_pid This should be set to a process ID, typically the process 
> ID  of
> >   the sender.
> >
> >si_uid This  should  be set to a user ID, typically the real 
> user ID of
> >   the sender.
> >
> >si_value
> >   This field contains the user data to accompany the 
> signal.   For
> >   more information, see the description of the last (union 
> sigval)
> >   argument of sigqueue(3).
>     >
> >Internally, the kernel sets the si_signo field to the  value  
> specified
> >in  sig,  so that the receiver of the signal can also obtain the 
> signal
> >number via that field.
> >
> > On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
> >>
> >> If there is some application that calls sigqueueinfo directly that has
> >> a problem with this added sanity check we can revisit this when we see
> >> what kind of crazy that application is doing.
> >
> >
> > I already know two "applications" ;)
> >
> > 
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> > 
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
> >
> > Disclaimer: I'm the author of both of them.
> 
> Looking at the kernel code the historical behavior has alwasy been to 
> prefer
> the signal number passed in by the kernel.
> 
> So sigh.  Implmenet __copy_siginfo_from_user and 
> __copy_siginfo_from_user32 to
> take that signal number and prefer it.  The user of ptrace will still
> use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
> never have had a signal number there.
> 
> Luckily this change has never made it farther than linux-next.
> 
> Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
> Reported-by: Andrei Vagin 
> Tested-by: Andrei Vagin 
> Signed-off-by: "Eric W. Biederm

Re: [LKP] 601d5abfea [ 13.686356] BUG: unable to handle kernel paging request at 34ca027e

2018-10-12 Thread Eric W. Biederman

kernel test robot  writes:

> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> siginfo-next
>

This is an odd one.  This is the first time I recall the lkp test robot
emailing me and telling me that I have fixed an issue.  I don't mind I
am just a little surprised.

Recently it was discovered that on my branch passing a large negative signal
number to rt_sigqueueinfo could cause an oops.  The underlying issues
were fixed by:

b2a2ab527d6d ("signal: Guard against negative signal numbers in 
copy_siginfo_from_user")
a36700589b85 ("signal: Guard against negative signal numbers in 
copy_siginfo_from_user32")

It appears that this issue was found in linux-next before these fixes
were applied.  And then the top of my siginfo-next branch where these
fixes exist was tested.

I have not problem with that sequence of events it is just a little
surprising.

If I have not read this test report correctly please let me know.

Eric

> commit 601d5abfeaf244b86bb68c1e05c6e0d57be2f6b0
> Author: Eric W. Biederman 
> AuthorDate: Fri Oct 5 09:02:48 2018 +0200
> Commit: Eric W. Biederman 
> CommitDate: Mon Oct 8 09:35:26 2018 +0200
>
> signal: In sigqueueinfo prefer sig not si_signo
> 
> Andrei Vagin  reported:
> 
> > Accoding to the man page, the user should not set si_signo, it has to 
> be set
> > by kernel.
> >
> > $ man 2 rt_sigqueueinfo
> >
> > The uinfo argument specifies the data to accompany  the  signal.   
> This
> >argument  is  a  pointer to a structure of type siginfo_t, 
> described in
> >sigaction(2) (and defined  by  including  ).   The  
> caller
> >should set the following fields in this structure:
> >
> >si_code
> >   This  must  be  one of the SI_* codes in the Linux kernel 
> source
> >   file include/asm-generic/siginfo.h, with  the  
> restriction  that
> >   the  code  must  be  negative (i.e., cannot be SI_USER, 
> which is
> >   used by the kernel to indicate a signal  sent  by  
> kill(2))  and
> >   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used 
> by the
> >   kernel to indicate a signal sent using tgkill(2)).
> >
> >si_pid This should be set to a process ID, typically the process 
> ID  of
> >   the sender.
> >
> >si_uid This  should  be set to a user ID, typically the real 
> user ID of
> >   the sender.
> >
> >si_value
> >   This field contains the user data to accompany the 
> signal.   For
> >   more information, see the description of the last (union 
> sigval)
> >   argument of sigqueue(3).
>     >
> >Internally, the kernel sets the si_signo field to the  value  
> specified
> >in  sig,  so that the receiver of the signal can also obtain the 
> signal
> >number via that field.
> >
> > On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
> >>
> >> If there is some application that calls sigqueueinfo directly that has
> >> a problem with this added sanity check we can revisit this when we see
> >> what kind of crazy that application is doing.
> >
> >
> > I already know two "applications" ;)
> >
> > 
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> > 
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
> >
> > Disclaimer: I'm the author of both of them.
> 
> Looking at the kernel code the historical behavior has alwasy been to 
> prefer
> the signal number passed in by the kernel.
> 
> So sigh.  Implmenet __copy_siginfo_from_user and 
> __copy_siginfo_from_user32 to
> take that signal number and prefer it.  The user of ptrace will still
> use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
> never have had a signal number there.
> 
> Luckily this change has never made it farther than linux-next.
> 
> Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
> Reported-by: Andrei Vagin 
> Tested-by: Andrei Vagin 
> Signed-off-by: "Eric W. Biederm

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-11 Thread Eric W. Biederman

David Howells  writes:

> Okay, this appears to fix the cycle-creation problem.
>
> It could probably be improved by comparing sequence numbers as Alan suggests,
> but I need to work out how to get at that.

It should just be a matter of replacing the test
"if (p->mnt.mnt_sb->s_type == )" with "if mnt_ns_loop(p->mnt.mnt_root)"

That would allow reusing 100% of the existing logic, and remove the need
to export file_system_type nsfs;

As your test exists below it will reject a lot more than mount namespace
file descriptors.  It will reject file descriptors for every other
namespace as well.

Eric

> ---
> commit 069c3376f7849044117c866aeafbb1a525f84926
> Author: David Howells 
> Date:   Thu Oct 4 23:18:59 2018 +0100
>
> fixes
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 17029b30e196..47a6c80c3c51 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -172,6 +172,7 @@ extern void mnt_pin_kill(struct mount *m);
>   * fs/nsfs.c
>   */
>  extern const struct dentry_operations ns_dentry_operations;
> +extern struct file_system_type nsfs;
>  
>  /*
>   * fs/ioctl.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e969ded7d54b..25ecd8b3c76b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2388,6 +2388,27 @@ static inline int tree_contains_unbindable(struct 
> mount *mnt)
>   return 0;
>  }
>  
> +/*
> + * Object if there are any nsfs mounts in the specified subtree.  These can 
> act
> + * as pins for mount namespaces that aren't checked by the mount-cycle 
> checking
> + * code, thereby allowing cycles to be made.
> + */
> +static bool check_for_nsfs_mounts(struct mount *subtree)
> +{
> + struct mount *p;
> + bool ret = false;
> +
> + lock_mount_hash();
> + for (p = subtree; p; p = next_mnt(p, subtree))
> + if (p->mnt.mnt_sb->s_type == )
> + goto out;
> +
> + ret = true;
> +out:
> + unlock_mount_hash();
> + return ret;
> +}
> +
>  static int do_move_mount(struct path *old_path, struct path *new_path)
>  {
>   struct path parent_path = {.mnt = NULL, .dentry = NULL};
> @@ -2442,6 +2463,8 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   if (IS_MNT_SHARED(p) && tree_contains_unbindable(old))
>   goto out1;
>   err = -ELOOP;
> + if (!check_for_nsfs_mounts(old))
> + goto out1;
>   for (; mnt_has_parent(p); p = p->mnt_parent)
>   if (p == old)
>   goto out1;
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index f069eb6495b0..d3abcd5c2a23 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -269,7 +269,7 @@ static struct dentry *nsfs_mount(struct file_system_type 
> *fs_type,
>   return mount_pseudo(fs_type, "nsfs:", _ops,
>   _dentry_operations, NSFS_MAGIC);
>  }
> -static struct file_system_type nsfs = {
> +struct file_system_type nsfs = {
>   .name = "nsfs",
>   .mount = nsfs_mount,
>   .kill_sb = kill_anon_super,

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-11 Thread Eric W. Biederman

David Howells  writes:

> Okay, this appears to fix the cycle-creation problem.
>
> It could probably be improved by comparing sequence numbers as Alan suggests,
> but I need to work out how to get at that.

It should just be a matter of replacing the test
"if (p->mnt.mnt_sb->s_type == )" with "if mnt_ns_loop(p->mnt.mnt_root)"

That would allow reusing 100% of the existing logic, and remove the need
to export file_system_type nsfs;

As your test exists below it will reject a lot more than mount namespace
file descriptors.  It will reject file descriptors for every other
namespace as well.

Eric

> ---
> commit 069c3376f7849044117c866aeafbb1a525f84926
> Author: David Howells 
> Date:   Thu Oct 4 23:18:59 2018 +0100
>
> fixes
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 17029b30e196..47a6c80c3c51 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -172,6 +172,7 @@ extern void mnt_pin_kill(struct mount *m);
>   * fs/nsfs.c
>   */
>  extern const struct dentry_operations ns_dentry_operations;
> +extern struct file_system_type nsfs;
>  
>  /*
>   * fs/ioctl.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e969ded7d54b..25ecd8b3c76b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2388,6 +2388,27 @@ static inline int tree_contains_unbindable(struct 
> mount *mnt)
>   return 0;
>  }
>  
> +/*
> + * Object if there are any nsfs mounts in the specified subtree.  These can 
> act
> + * as pins for mount namespaces that aren't checked by the mount-cycle 
> checking
> + * code, thereby allowing cycles to be made.
> + */
> +static bool check_for_nsfs_mounts(struct mount *subtree)
> +{
> + struct mount *p;
> + bool ret = false;
> +
> + lock_mount_hash();
> + for (p = subtree; p; p = next_mnt(p, subtree))
> + if (p->mnt.mnt_sb->s_type == )
> + goto out;
> +
> + ret = true;
> +out:
> + unlock_mount_hash();
> + return ret;
> +}
> +
>  static int do_move_mount(struct path *old_path, struct path *new_path)
>  {
>   struct path parent_path = {.mnt = NULL, .dentry = NULL};
> @@ -2442,6 +2463,8 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   if (IS_MNT_SHARED(p) && tree_contains_unbindable(old))
>   goto out1;
>   err = -ELOOP;
> + if (!check_for_nsfs_mounts(old))
> + goto out1;
>   for (; mnt_has_parent(p); p = p->mnt_parent)
>   if (p == old)
>   goto out1;
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index f069eb6495b0..d3abcd5c2a23 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -269,7 +269,7 @@ static struct dentry *nsfs_mount(struct file_system_type 
> *fs_type,
>   return mount_pseudo(fs_type, "nsfs:", _ops,
>   _dentry_operations, NSFS_MAGIC);
>  }
> -static struct file_system_type nsfs = {
> +struct file_system_type nsfs = {
>   .name = "nsfs",
>   .mount = nsfs_mount,
>   .kill_sb = kill_anon_super,

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> Sean Christopherson  writes:
>
>> On Wed, Oct 10, 2018 at 05:06:52PM -0500, Eric W. Biederman wrote:
>>> ebied...@xmission.com (Eric W. Biederman) writes:
>>> 
>>> > So I am flummoxed.  I am reading through the code and I don't see
>>> > anything that could trigger this, and when I ran the supplied reproducer
>>> > it did not reproduce for me.
>>> 
>>> Even more so.  With my tool chain the line that reports the failing
>>> address is impossible.
>>> 
>>> [   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
>>> 
>>> With the supplied configureation my tool chain only has 0x30 bytes for
>>> all of copy_siginfo_from_user.  So I can't even begin to guess where
>>> in that function things are failing.
>>> 
>>> Any additional information that you can provide would be a real help
>>> in tracking down this strange failure.
>>
>> I don't have the exact toolchain, but I was able to get somewhat close
>> and may have found a smoking gun.  0x4d in my build is in the general
>> vicinity of "sig_sicodes[sig].limit" in known_siginfo_layout().  This
>> lines up with the register state from the log, e.g. RDI=0500104d8,
>> which is the mask generated by sig_specific_sicodes.  From what I can
>> tell, @sig is never bounds checked.  If the compiler generated an AND
>> instruction to compare against sig_specific_sicodes then that could
>> resolve true with any arbitrary value that happened to collide with
>> sig_specific_sicodes and result in an out-of-bounds access to
>> @sig_sicodes.  siginfo_layout() for example explicitly checks @sig
>> before indexing @sig_sicode, e.g. "sig < ARRAY_SIZE(sig_sicodes)".
>>
>> Maybe this?
>
> But sig is bounds checked.  Even better sig is checked to see if it
> is one of the values in the array.
>
>> From include/linux/signal.h
>
> #define SIG_SPECIFIC_SICODES_MASK (\
>   rt_sigmask(SIGILL)|  rt_sigmask(SIGFPE)| \
>   rt_sigmask(SIGSEGV)   |  rt_sigmask(SIGBUS)| \
>   rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGCHLD)   | \
>   rt_sigmask(SIGPOLL)   |  rt_sigmask(SIGSYS)| \
>   SIGEMT_MASK)
>
> #define siginmask(sig, mask) \
>   ((sig) < SIGRTMIN && (rt_sigmask(sig) & (mask)))
>
> #define sig_specific_sicodes(sig) siginmask(sig, 
> SIG_SPECIFIC_SICODES_MASK)
>
>
>
> Hmm.  I wonder if something is passing in a negative signal number.
> There is not a bounds check for that.  A sufficiently large signal
> number might be the problem here.  Yes.  I can get an oops with
> a sufficiently large negative signal number.
>
> The code will later call valid_signal in check_permissions and
> that will cause the system call to fail, so the issue is just that
> the signal number is not being validated early enough.
>
> On the output path (copy_siginfo_to_user and copy_siginfo_to_user32) the
> signal number should be validated before it ever reaches userspace
> which is why I expect trinity never triggered anything.
>
> There is copy_siginfo_from_user32 and that does call siginfo_layout with
> a possibly negative signal number.  Which has the same potential issues.
>
> So I am going to go with the fix below.  That fixes things in my testing
> and by being unsigned should fix keep negative numbers from being a
> problem.

Sean thank you very much for putting me on the right path to track this
failing test down.

Eric

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> Sean Christopherson  writes:
>
>> On Wed, Oct 10, 2018 at 05:06:52PM -0500, Eric W. Biederman wrote:
>>> ebied...@xmission.com (Eric W. Biederman) writes:
>>> 
>>> > So I am flummoxed.  I am reading through the code and I don't see
>>> > anything that could trigger this, and when I ran the supplied reproducer
>>> > it did not reproduce for me.
>>> 
>>> Even more so.  With my tool chain the line that reports the failing
>>> address is impossible.
>>> 
>>> [   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
>>> 
>>> With the supplied configureation my tool chain only has 0x30 bytes for
>>> all of copy_siginfo_from_user.  So I can't even begin to guess where
>>> in that function things are failing.
>>> 
>>> Any additional information that you can provide would be a real help
>>> in tracking down this strange failure.
>>
>> I don't have the exact toolchain, but I was able to get somewhat close
>> and may have found a smoking gun.  0x4d in my build is in the general
>> vicinity of "sig_sicodes[sig].limit" in known_siginfo_layout().  This
>> lines up with the register state from the log, e.g. RDI=0500104d8,
>> which is the mask generated by sig_specific_sicodes.  From what I can
>> tell, @sig is never bounds checked.  If the compiler generated an AND
>> instruction to compare against sig_specific_sicodes then that could
>> resolve true with any arbitrary value that happened to collide with
>> sig_specific_sicodes and result in an out-of-bounds access to
>> @sig_sicodes.  siginfo_layout() for example explicitly checks @sig
>> before indexing @sig_sicode, e.g. "sig < ARRAY_SIZE(sig_sicodes)".
>>
>> Maybe this?
>
> But sig is bounds checked.  Even better sig is checked to see if it
> is one of the values in the array.
>
>> From include/linux/signal.h
>
> #define SIG_SPECIFIC_SICODES_MASK (\
>   rt_sigmask(SIGILL)|  rt_sigmask(SIGFPE)| \
>   rt_sigmask(SIGSEGV)   |  rt_sigmask(SIGBUS)| \
>   rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGCHLD)   | \
>   rt_sigmask(SIGPOLL)   |  rt_sigmask(SIGSYS)| \
>   SIGEMT_MASK)
>
> #define siginmask(sig, mask) \
>   ((sig) < SIGRTMIN && (rt_sigmask(sig) & (mask)))
>
> #define sig_specific_sicodes(sig) siginmask(sig, 
> SIG_SPECIFIC_SICODES_MASK)
>
>
>
> Hmm.  I wonder if something is passing in a negative signal number.
> There is not a bounds check for that.  A sufficiently large signal
> number might be the problem here.  Yes.  I can get an oops with
> a sufficiently large negative signal number.
>
> The code will later call valid_signal in check_permissions and
> that will cause the system call to fail, so the issue is just that
> the signal number is not being validated early enough.
>
> On the output path (copy_siginfo_to_user and copy_siginfo_to_user32) the
> signal number should be validated before it ever reaches userspace
> which is why I expect trinity never triggered anything.
>
> There is copy_siginfo_from_user32 and that does call siginfo_layout with
> a possibly negative signal number.  Which has the same potential issues.
>
> So I am going to go with the fix below.  That fixes things in my testing
> and by being unsigned should fix keep negative numbers from being a
> problem.

Sean thank you very much for putting me on the right path to track this
failing test down.

Eric

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman

Sean Christopherson  writes:

> On Wed, Oct 10, 2018 at 05:06:52PM -0500, Eric W. Biederman wrote:
>> ebied...@xmission.com (Eric W. Biederman) writes:
>> 
>> > So I am flummoxed.  I am reading through the code and I don't see
>> > anything that could trigger this, and when I ran the supplied reproducer
>> > it did not reproduce for me.
>> 
>> Even more so.  With my tool chain the line that reports the failing
>> address is impossible.
>> 
>> [   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
>> 
>> With the supplied configureation my tool chain only has 0x30 bytes for
>> all of copy_siginfo_from_user.  So I can't even begin to guess where
>> in that function things are failing.
>> 
>> Any additional information that you can provide would be a real help
>> in tracking down this strange failure.
>
> I don't have the exact toolchain, but I was able to get somewhat close
> and may have found a smoking gun.  0x4d in my build is in the general
> vicinity of "sig_sicodes[sig].limit" in known_siginfo_layout().  This
> lines up with the register state from the log, e.g. RDI=0500104d8,
> which is the mask generated by sig_specific_sicodes.  From what I can
> tell, @sig is never bounds checked.  If the compiler generated an AND
> instruction to compare against sig_specific_sicodes then that could
> resolve true with any arbitrary value that happened to collide with
> sig_specific_sicodes and result in an out-of-bounds access to
> @sig_sicodes.  siginfo_layout() for example explicitly checks @sig
> before indexing @sig_sicode, e.g. "sig < ARRAY_SIZE(sig_sicodes)".
>
> Maybe this?

But sig is bounds checked.  Even better sig is checked to see if it
is one of the values in the array.

>From include/linux/signal.h

#define SIG_SPECIFIC_SICODES_MASK (\
rt_sigmask(SIGILL)|  rt_sigmask(SIGFPE)| \
rt_sigmask(SIGSEGV)   |  rt_sigmask(SIGBUS)| \
rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGCHLD)   | \
rt_sigmask(SIGPOLL)   |  rt_sigmask(SIGSYS)| \
SIGEMT_MASK)

#define siginmask(sig, mask) \
((sig) < SIGRTMIN && (rt_sigmask(sig) & (mask)))

#define sig_specific_sicodes(sig)   siginmask(sig, 
SIG_SPECIFIC_SICODES_MASK)



Hmm.  I wonder if something is passing in a negative signal number.
There is not a bounds check for that.  A sufficiently large signal
number might be the problem here.  Yes.  I can get an oops with
a sufficiently large negative signal number.

The code will later call valid_signal in check_permissions and
that will cause the system call to fail, so the issue is just that
the signal number is not being validated early enough.

On the output path (copy_siginfo_to_user and copy_siginfo_to_user32) the
signal number should be validated before it ever reaches userspace
which is why I expect trinity never triggered anything.

There is copy_siginfo_from_user32 and that does call siginfo_layout with
a possibly negative signal number.  Which has the same potential issues.

So I am going to go with the fix below.  That fixes things in my testing
and by being unsigned should fix keep negative numbers from being a
problem.

diff --git a/kernel/signal.c b/kernel/signal.c
index 2bffc5a50183..4fd431ce4f91 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2860,7 +2860,7 @@ static const struct {
[SIGSYS]  = { NSIGSYS,  SIL_SYS },
 };
 
-static bool known_siginfo_layout(int sig, int si_code)
+static bool known_siginfo_layout(unsigned sig, int si_code)
 {
if (si_code == SI_KERNEL)
return true;
@@ -2879,7 +2879,7 @@ static bool known_siginfo_layout(int sig, int si_code)
return false;
 }
 
-enum siginfo_layout siginfo_layout(int sig, int si_code)
+enum siginfo_layout siginfo_layout(unsigned sig, int si_code)
 {
enum siginfo_layout layout = SIL_KILL;
if ((si_code > SI_USER) && (si_code < SI_KERNEL)) {

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman

Sean Christopherson  writes:

> On Wed, Oct 10, 2018 at 05:06:52PM -0500, Eric W. Biederman wrote:
>> ebied...@xmission.com (Eric W. Biederman) writes:
>> 
>> > So I am flummoxed.  I am reading through the code and I don't see
>> > anything that could trigger this, and when I ran the supplied reproducer
>> > it did not reproduce for me.
>> 
>> Even more so.  With my tool chain the line that reports the failing
>> address is impossible.
>> 
>> [   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
>> 
>> With the supplied configureation my tool chain only has 0x30 bytes for
>> all of copy_siginfo_from_user.  So I can't even begin to guess where
>> in that function things are failing.
>> 
>> Any additional information that you can provide would be a real help
>> in tracking down this strange failure.
>
> I don't have the exact toolchain, but I was able to get somewhat close
> and may have found a smoking gun.  0x4d in my build is in the general
> vicinity of "sig_sicodes[sig].limit" in known_siginfo_layout().  This
> lines up with the register state from the log, e.g. RDI=0500104d8,
> which is the mask generated by sig_specific_sicodes.  From what I can
> tell, @sig is never bounds checked.  If the compiler generated an AND
> instruction to compare against sig_specific_sicodes then that could
> resolve true with any arbitrary value that happened to collide with
> sig_specific_sicodes and result in an out-of-bounds access to
> @sig_sicodes.  siginfo_layout() for example explicitly checks @sig
> before indexing @sig_sicode, e.g. "sig < ARRAY_SIZE(sig_sicodes)".
>
> Maybe this?

But sig is bounds checked.  Even better sig is checked to see if it
is one of the values in the array.

>From include/linux/signal.h

#define SIG_SPECIFIC_SICODES_MASK (\
rt_sigmask(SIGILL)|  rt_sigmask(SIGFPE)| \
rt_sigmask(SIGSEGV)   |  rt_sigmask(SIGBUS)| \
rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGCHLD)   | \
rt_sigmask(SIGPOLL)   |  rt_sigmask(SIGSYS)| \
SIGEMT_MASK)

#define siginmask(sig, mask) \
((sig) < SIGRTMIN && (rt_sigmask(sig) & (mask)))

#define sig_specific_sicodes(sig)   siginmask(sig, 
SIG_SPECIFIC_SICODES_MASK)



Hmm.  I wonder if something is passing in a negative signal number.
There is not a bounds check for that.  A sufficiently large signal
number might be the problem here.  Yes.  I can get an oops with
a sufficiently large negative signal number.

The code will later call valid_signal in check_permissions and
that will cause the system call to fail, so the issue is just that
the signal number is not being validated early enough.

On the output path (copy_siginfo_to_user and copy_siginfo_to_user32) the
signal number should be validated before it ever reaches userspace
which is why I expect trinity never triggered anything.

There is copy_siginfo_from_user32 and that does call siginfo_layout with
a possibly negative signal number.  Which has the same potential issues.

So I am going to go with the fix below.  That fixes things in my testing
and by being unsigned should fix keep negative numbers from being a
problem.

diff --git a/kernel/signal.c b/kernel/signal.c
index 2bffc5a50183..4fd431ce4f91 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2860,7 +2860,7 @@ static const struct {
[SIGSYS]  = { NSIGSYS,  SIL_SYS },
 };
 
-static bool known_siginfo_layout(int sig, int si_code)
+static bool known_siginfo_layout(unsigned sig, int si_code)
 {
if (si_code == SI_KERNEL)
return true;
@@ -2879,7 +2879,7 @@ static bool known_siginfo_layout(int sig, int si_code)
return false;
 }
 
-enum siginfo_layout siginfo_layout(int sig, int si_code)
+enum siginfo_layout siginfo_layout(unsigned sig, int si_code)
 {
enum siginfo_layout layout = SIL_KILL;
if ((si_code > SI_USER) && (si_code < SI_KERNEL)) {

Re: [Ksummit-discuss] [PATCH v2 0/3] code of conduct fixes

2018-10-10 Thread Eric W. Biederman

James Bottomley  writes:

> Resend to show accumulated tags and also to add a third patch listing
> the TAB as the reporting point as a few people seem to want.  If it
> gets the same level of support, I'll send it in with the other two.

There is also:

> Our Responsibilities
> 
> 
> Maintainers are responsible for clarifying the standards of acceptable 
> behavior
> and are expected to take appropriate and fair corrective action in response to
> any instances of unacceptable behavior.
> 
> Maintainers have the right and responsibility to remove, edit, or reject
> comments, commits, code, wiki edits, issues, and other contributions that are
> not aligned to this Code of Conduct, or to ban temporarily or permanently any
> contributor for other behaviors that they deem inappropriate, threatening,
> offensive, or harmful.

Which is very problematic.
a) In append only logs like git we can not edit history.
   Making it a mainters responsibility to edit the history, to do the
   impossible is a problem.

b) There are no responsibilities of for people who are not Maintainers.
   That is another problem.

c) The entire tone of the reponsibilities section is out of line with a
   community where there are no enforcement powers only the power to
   accept or not accept a patch.  Only the power to persuade not to
   enforce.

Overall in the discussions I have heard people talking about persuading,
educating, and not feeding trolls.   Nowhere have I heard people talking
about policing the community which I understand that responsiblity
section to be talking about.

Increasingly I am getting the feeling that this document does not the
linux development community.  Perhaps a revert and trying to come up
with better language from scratch would be better.

I don't know how to rephrase that reponsibility section but if we don't
go with the revert something looks like it need sot be done there.

Eric

Re: [Ksummit-discuss] [PATCH v2 0/3] code of conduct fixes

2018-10-10 Thread Eric W. Biederman

James Bottomley  writes:

> Resend to show accumulated tags and also to add a third patch listing
> the TAB as the reporting point as a few people seem to want.  If it
> gets the same level of support, I'll send it in with the other two.

There is also:

> Our Responsibilities
> 
> 
> Maintainers are responsible for clarifying the standards of acceptable 
> behavior
> and are expected to take appropriate and fair corrective action in response to
> any instances of unacceptable behavior.
> 
> Maintainers have the right and responsibility to remove, edit, or reject
> comments, commits, code, wiki edits, issues, and other contributions that are
> not aligned to this Code of Conduct, or to ban temporarily or permanently any
> contributor for other behaviors that they deem inappropriate, threatening,
> offensive, or harmful.

Which is very problematic.
a) In append only logs like git we can not edit history.
   Making it a mainters responsibility to edit the history, to do the
   impossible is a problem.

b) There are no responsibilities of for people who are not Maintainers.
   That is another problem.

c) The entire tone of the reponsibilities section is out of line with a
   community where there are no enforcement powers only the power to
   accept or not accept a patch.  Only the power to persuade not to
   enforce.

Overall in the discussions I have heard people talking about persuading,
educating, and not feeding trolls.   Nowhere have I heard people talking
about policing the community which I understand that responsiblity
section to be talking about.

Increasingly I am getting the feeling that this document does not the
linux development community.  Perhaps a revert and trying to come up
with better language from scratch would be better.

I don't know how to rephrase that reponsibility section but if we don't
go with the revert something looks like it need sot be done there.

Eric

Re: [Ksummit-discuss] [PATCH v2 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses

2018-10-10 Thread Eric W. Biederman

James Bottomley  writes:

> The current code of conduct has an ambiguity in the it considers publishing
> private information such as email addresses unacceptable behaviour.  Since
> the Linux kernel collects and publishes email addresses as part of the patch
> process, add an exception clause for email addresses ordinarily collected by
> the project to correct this ambiguity.

Acked-by: "Eric W. Biederman" 

>
> Fixes: 8a104f8b5867c682 ("Code of Conduct: Let's revamp it.")
> Acked-by: Geert Uytterhoeven 
> Acked-by: Shuah Khan 
> Acked-by: Guenter Roeck 
> Reviewed-by: Alan Cox 
> Reviewed-by: Mauro Carvalho Chehab 
> Signed-off-by: James Bottomley 
> ---
>  Documentation/process/code-of-conduct.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/process/code-of-conduct.rst 
> b/Documentation/process/code-of-conduct.rst
> index ab7c24b5478c..aa40e34e7785 100644
> --- a/Documentation/process/code-of-conduct.rst
> +++ b/Documentation/process/code-of-conduct.rst
> @@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include:
>  * Trolling, insulting/derogatory comments, and personal or political attacks
>  * Public or private harassment
>  * Publishing others’ private information, such as a physical or electronic
> -  address, without explicit permission
> +  address not ordinarily collected by the project, without explicit 
> permission
>  * Other conduct which could reasonably be considered inappropriate in a
>professional setting

Re: [Ksummit-discuss] [PATCH v2 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses

2018-10-10 Thread Eric W. Biederman

James Bottomley  writes:

> The current code of conduct has an ambiguity in the it considers publishing
> private information such as email addresses unacceptable behaviour.  Since
> the Linux kernel collects and publishes email addresses as part of the patch
> process, add an exception clause for email addresses ordinarily collected by
> the project to correct this ambiguity.

Acked-by: "Eric W. Biederman" 

>
> Fixes: 8a104f8b5867c682 ("Code of Conduct: Let's revamp it.")
> Acked-by: Geert Uytterhoeven 
> Acked-by: Shuah Khan 
> Acked-by: Guenter Roeck 
> Reviewed-by: Alan Cox 
> Reviewed-by: Mauro Carvalho Chehab 
> Signed-off-by: James Bottomley 
> ---
>  Documentation/process/code-of-conduct.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/process/code-of-conduct.rst 
> b/Documentation/process/code-of-conduct.rst
> index ab7c24b5478c..aa40e34e7785 100644
> --- a/Documentation/process/code-of-conduct.rst
> +++ b/Documentation/process/code-of-conduct.rst
> @@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include:
>  * Trolling, insulting/derogatory comments, and personal or political attacks
>  * Public or private harassment
>  * Publishing others’ private information, such as a physical or electronic
> -  address, without explicit permission
> +  address not ordinarily collected by the project, without explicit 
> permission
>  * Other conduct which could reasonably be considered inappropriate in a
>professional setting

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> So I am flummoxed.  I am reading through the code and I don't see
> anything that could trigger this, and when I ran the supplied reproducer
> it did not reproduce for me.

Even more so.  With my tool chain the line that reports the failing
address is impossible.

[   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0

With the supplied configureation my tool chain only has 0x30 bytes for
all of copy_siginfo_from_user.  So I can't even begin to guess where
in that function things are failing.

Any additional information that you can provide would be a real help
in tracking down this strange failure.

Thank you,
Eric Biederman

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> So I am flummoxed.  I am reading through the code and I don't see
> anything that could trigger this, and when I ran the supplied reproducer
> it did not reproduce for me.

Even more so.  With my tool chain the line that reports the failing
address is impossible.

[   73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0

With the supplied configureation my tool chain only has 0x30 bytes for
all of copy_siginfo_from_user.  So I can't even begin to guess where
in that function things are failing.

Any additional information that you can provide would be a real help
in tracking down this strange failure.

Thank you,
Eric Biederman

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman



So I am flummoxed.  I am reading through the code and I don't see
anything that could trigger this, and when I ran the supplied reproducer
it did not reproduce for me.

Plus there is the noise from the kmalloc_slab test that is goofing up
the subject line.

Is there any chance I can get a disassembly of the
copy_siginfo_from_user or post_copy_siginfo_from_user from your build?
I don't have the same tool chain.

Right now I am strongly suspecting that there is a memory stomp
somewhere and the earlier tests just happen on something that is the
pinpointed commit to misbehave.

Either that or it is simply that I don't have the latest and greatest
smep/smap hardware and there is an off by one I am not seeing.

I don't doubt that this test is finding something I haven't figured out
how to see what it is finding, and when I exercise the same code path
with my own tests everything appears to work.

Eric

kernel test robot  writes:

> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
>
> commit 4ce5f9c9e7546915c559ffae594e6d73f918db00
> Author: Eric W. Biederman 
> AuthorDate: Tue Sep 25 12:59:31 2018 +0200
> Commit: Eric W. Biederman 
> CommitDate: Wed Oct 3 16:50:39 2018 +0200
>
> signal: Use a smaller struct siginfo in the kernel
> 
> We reserve 128 bytes for struct siginfo but only use about 48 bytes on
> 64bit and 32 bytes on 32bit.  Someday we might use more but it is unlikely
> to be anytime soon.
> 
> Userspace seems content with just enough bytes of siginfo to implement
> sigqueue.  Or in the case of checkpoint/restart reinjecting signals
> the kernel has sent.
> 
> Reducing the stack footprint and the work to copy siginfo around from
> 2 cachelines to 1 cachelines seems worth doing even if I don't have
> benchmarks to show a performance difference.
>     
> Suggested-by: Linus Torvalds 
> Signed-off-by: "Eric W. Biederman" 
>
> ae7795bc61  signal: Distinguish between kernel_siginfo and siginfo
> 4ce5f9c9e7  signal: Use a smaller struct siginfo in the kernel
> 570b7bdeaf  Add linux-next specific files for 20181009
> +---+++---+
> |   | ae7795bc61 | 4ce5f9c9e7 | 
> next-20181009 |
> +---+++---+
> | boot_successes| 0  | 0  | 28
> |
> | boot_failures | 1144   | 280| 8 
> |
> | WARNING:at_mm/slab_common.c:#kmalloc_slab | 1144   | 280|   
> |
> | RIP:kmalloc_slab  | 1144   | 280|   
> |
> | Mem-Info  | 1144   | 280| 8 
> |
> | BUG:unable_to_handle_kernel   | 0  | 5  | 7 
> |
> | Oops:#[##]| 0  | 7  | 8 
> |
> | RIP:copy_siginfo_from_user| 0  | 7  |   
> |
> | Kernel_panic-not_syncing:Fatal_exception  | 0  | 7  | 8 
> |
> | RIP:post_copy_siginfo_from_user   | 0  | 0  | 8 
> |
> +---+++---+
>
> [1.320405] test_overflow: ok: (s8)(0 << 7) == 0
> [1.321071] test_overflow: ok: (s16)(0 << 15) == 0
> [1.321756] test_overflow: ok: (int)(0 << 31) == 0
> [1.322442] test_overflow: ok: (s32)(0 << 31) == 0
> [1.323121] test_overflow: ok: (s64)(0 << 63) == 0
> [1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 
> kmalloc_slab+0x17/0x70
> [1.324113] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GT 
> 4.19.0-rc1-00077-g4ce5f9c #1
> [1.324113] RIP: 0010:kmalloc_slab+0x17/0x70
> [1.324113] Code: 00 00 00 83 3d 11 78 14 03 02 55 48 89 e5 5d 0f 97 c0 c3 
> 55 48 81 ff 00 00 40 00 48 89 e5 76 0e 31 c0 81 e6 00 02 00 00 75 4b <0f> 0b 
> eb 47 48 81 ff c0 00 00 00 77 19 48 85 ff b8 10 00 00 00 74
> [1.324113] RSP: :88000fc7fd50 EFLAGS: 00010246
> [1.324113] RAX:  RBX: 006000c0 RCX: 
> 88001fb68d47
> [1.324113] RDX: 0001 RSI:  RDI: 
> 
> [1.324113] RBP: 88000fc7fd50 R08: b128ac78 R09: 
> 0001
> [1.324113] R10: 0001 R11:  R12: 
> 88001d814800
> [1.324113] R13:

Re: [LKP] 4ce5f9c9e7 [ 1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 kmalloc_slab

2018-10-10 Thread Eric W. Biederman



So I am flummoxed.  I am reading through the code and I don't see
anything that could trigger this, and when I ran the supplied reproducer
it did not reproduce for me.

Plus there is the noise from the kmalloc_slab test that is goofing up
the subject line.

Is there any chance I can get a disassembly of the
copy_siginfo_from_user or post_copy_siginfo_from_user from your build?
I don't have the same tool chain.

Right now I am strongly suspecting that there is a memory stomp
somewhere and the earlier tests just happen on something that is the
pinpointed commit to misbehave.

Either that or it is simply that I don't have the latest and greatest
smep/smap hardware and there is an off by one I am not seeing.

I don't doubt that this test is finding something I haven't figured out
how to see what it is finding, and when I exercise the same code path
with my own tests everything appears to work.

Eric

kernel test robot  writes:

> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
>
> commit 4ce5f9c9e7546915c559ffae594e6d73f918db00
> Author: Eric W. Biederman 
> AuthorDate: Tue Sep 25 12:59:31 2018 +0200
> Commit: Eric W. Biederman 
> CommitDate: Wed Oct 3 16:50:39 2018 +0200
>
> signal: Use a smaller struct siginfo in the kernel
> 
> We reserve 128 bytes for struct siginfo but only use about 48 bytes on
> 64bit and 32 bytes on 32bit.  Someday we might use more but it is unlikely
> to be anytime soon.
> 
> Userspace seems content with just enough bytes of siginfo to implement
> sigqueue.  Or in the case of checkpoint/restart reinjecting signals
> the kernel has sent.
> 
> Reducing the stack footprint and the work to copy siginfo around from
> 2 cachelines to 1 cachelines seems worth doing even if I don't have
> benchmarks to show a performance difference.
>     
> Suggested-by: Linus Torvalds 
> Signed-off-by: "Eric W. Biederman" 
>
> ae7795bc61  signal: Distinguish between kernel_siginfo and siginfo
> 4ce5f9c9e7  signal: Use a smaller struct siginfo in the kernel
> 570b7bdeaf  Add linux-next specific files for 20181009
> +---+++---+
> |   | ae7795bc61 | 4ce5f9c9e7 | 
> next-20181009 |
> +---+++---+
> | boot_successes| 0  | 0  | 28
> |
> | boot_failures | 1144   | 280| 8 
> |
> | WARNING:at_mm/slab_common.c:#kmalloc_slab | 1144   | 280|   
> |
> | RIP:kmalloc_slab  | 1144   | 280|   
> |
> | Mem-Info  | 1144   | 280| 8 
> |
> | BUG:unable_to_handle_kernel   | 0  | 5  | 7 
> |
> | Oops:#[##]| 0  | 7  | 8 
> |
> | RIP:copy_siginfo_from_user| 0  | 7  |   
> |
> | Kernel_panic-not_syncing:Fatal_exception  | 0  | 7  | 8 
> |
> | RIP:post_copy_siginfo_from_user   | 0  | 0  | 8 
> |
> +---+++---+
>
> [1.320405] test_overflow: ok: (s8)(0 << 7) == 0
> [1.321071] test_overflow: ok: (s16)(0 << 15) == 0
> [1.321756] test_overflow: ok: (int)(0 << 31) == 0
> [1.322442] test_overflow: ok: (s32)(0 << 31) == 0
> [1.323121] test_overflow: ok: (s64)(0 << 63) == 0
> [1.323881] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:1031 
> kmalloc_slab+0x17/0x70
> [1.324113] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GT 
> 4.19.0-rc1-00077-g4ce5f9c #1
> [1.324113] RIP: 0010:kmalloc_slab+0x17/0x70
> [1.324113] Code: 00 00 00 83 3d 11 78 14 03 02 55 48 89 e5 5d 0f 97 c0 c3 
> 55 48 81 ff 00 00 40 00 48 89 e5 76 0e 31 c0 81 e6 00 02 00 00 75 4b <0f> 0b 
> eb 47 48 81 ff c0 00 00 00 77 19 48 85 ff b8 10 00 00 00 74
> [1.324113] RSP: :88000fc7fd50 EFLAGS: 00010246
> [1.324113] RAX:  RBX: 006000c0 RCX: 
> 88001fb68d47
> [1.324113] RDX: 0001 RSI:  RDI: 
> 
> [1.324113] RBP: 88000fc7fd50 R08: b128ac78 R09: 
> 0001
> [1.324113] R10: 0001 R11:  R12: 
> 88001d814800
> [1.324113] R13:

[REVIEW][PATCH 7/6] signal: In sigqueueinfo prefer sig not si_signo

2018-10-05 Thread Eric W. Biederman



Andrei Vagin  reported:

> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>>
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.

Looking at the kernel code the historical behavior has alwasy been to prefer
the signal number passed in by the kernel.

So sigh.  Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to
take that signal number and prefer it.  The user of ptrace will still
use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
never have had a signal number there.

Luckily this change has never made it farther than linux-next.

Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
Signed-off-by: "Eric W. Biederman" 
---

Andrei can you verify this fixes your programs?
Thank you very much,
Eric


 kernel/signal.c | 141 
 1 file changed, 84 insertions(+), 57 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 1c2dd117fee0..2bffc5a50183 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2925,11 +2925,10 @@ int copy_siginfo_to_user(siginfo_t __user *to, const 
kernel_siginfo_t *from)
return 0;
 }
 
-int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from)
+static int post_copy_siginfo_from_user(kernel_siginfo_t *info,
+  const siginfo_t __user *from)
 {
-   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
-   return -EFAULT;
-   if (unlikely(!known_siginfo_layout(to->si_signo, to->si_code))) {
+   if (unlikely(!known_siginfo_layout(info->si_signo, info->si_code))) {
char __user *expansion = si_expansion(from);
char buf[SI_EXPANSION_SIZE];
int i;
@@ -2949,6 +2948,22 @@ int copy_siginfo_from_user(kernel_siginfo_t *to, const 
siginfo_t __user *from)
return 0;
 }
 
+static int __copy_siginfo_from_user(int signo, kernel_siginfo_t *to,
+   const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
+   return -EFAULT;
+   to->si_signo = signo;
+   return post_copy_siginfo_from_user(to, from);
+}
+
+int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
+   return -EFAULT;
+   return post_copy_siginfo_from_user(to, from);
+}
+
 #ifdef CONFIG_COMPAT
 int copy_siginfo_to_user32(struct compat_siginfo __user *to,
   const struct kernel_siginfo *from)
@@ -3041,88 +3056,106 @@ int __copy_siginfo_to_user32(struct compat_siginfo 
__user *to,
return 0;
 }
 
-int copy_siginfo_from_user32(struct kernel_siginfo *to,
-const struct compat_sigin

[REVIEW][PATCH 7/6] signal: In sigqueueinfo prefer sig not si_signo

2018-10-05 Thread Eric W. Biederman



Andrei Vagin  reported:

> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>>
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.

Looking at the kernel code the historical behavior has alwasy been to prefer
the signal number passed in by the kernel.

So sigh.  Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to
take that signal number and prefer it.  The user of ptrace will still
use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
never have had a signal number there.

Luckily this change has never made it farther than linux-next.

Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
Signed-off-by: "Eric W. Biederman" 
---

Andrei can you verify this fixes your programs?
Thank you very much,
Eric


 kernel/signal.c | 141 
 1 file changed, 84 insertions(+), 57 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 1c2dd117fee0..2bffc5a50183 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2925,11 +2925,10 @@ int copy_siginfo_to_user(siginfo_t __user *to, const 
kernel_siginfo_t *from)
return 0;
 }
 
-int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from)
+static int post_copy_siginfo_from_user(kernel_siginfo_t *info,
+  const siginfo_t __user *from)
 {
-   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
-   return -EFAULT;
-   if (unlikely(!known_siginfo_layout(to->si_signo, to->si_code))) {
+   if (unlikely(!known_siginfo_layout(info->si_signo, info->si_code))) {
char __user *expansion = si_expansion(from);
char buf[SI_EXPANSION_SIZE];
int i;
@@ -2949,6 +2948,22 @@ int copy_siginfo_from_user(kernel_siginfo_t *to, const 
siginfo_t __user *from)
return 0;
 }
 
+static int __copy_siginfo_from_user(int signo, kernel_siginfo_t *to,
+   const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
+   return -EFAULT;
+   to->si_signo = signo;
+   return post_copy_siginfo_from_user(to, from);
+}
+
+int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from)
+{
+   if (copy_from_user(to, from, sizeof(struct kernel_siginfo)))
+   return -EFAULT;
+   return post_copy_siginfo_from_user(to, from);
+}
+
 #ifdef CONFIG_COMPAT
 int copy_siginfo_to_user32(struct compat_siginfo __user *to,
   const struct kernel_siginfo *from)
@@ -3041,88 +3056,106 @@ int __copy_siginfo_to_user32(struct compat_siginfo 
__user *to,
return 0;
 }
 
-int copy_siginfo_from_user32(struct kernel_siginfo *to,
-const struct compat_sigin

Re: [REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-10-05 Thread Eric W. Biederman

Andrei Vagin  writes:

> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>> The kernel needs to validate that the contents of struct siginfo make
>> sense as siginfo is copied into the kernel, so that the proper union
>> members can be put in the appropriate locations.  The field si_signo
>> is a fundamental part of that validation.  As such changing the
>> contents of si_signo after the validation make no sense and can result
>> in nonsense values in the kernel.
>
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.

I wanted to say look at copy_siginfo_from_user32 it uses si_signo and it
has always done so.  But before I got to fixing copy_siginfo_from_user32
only looked at si_code.  This is one of the areas where we deliberately
slightly changed the KABI.  To start looking an signo instead of magic
kernel internal si_code values.

So yes.  Looking at si_signo instead of the passed in signo appears to
be a regression all of the way around (except for ptrace) where that
value is not present.  So I will see if I can figure out how to refactor
the code to accomplish that.

I am travelling for the next couple of days so I won't be able to get to
this for a bit.

I am thinking I will need two version of copy_siginfo_from user now.
One for ptrace and one for rt_sgiqueueinfo.  Sigh.

> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
>> 
>> As such simply fail if someone is silly enough to set si_signo out of
>> sync with the signal number passed to sigqueueinfo.
>> 
>> I don't expect a problem as glibc's sigqueue implementation sets
>> "si_signo = sig" and CRIU just returns to the kernel what the kernel
>> gave to it.
>> 
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.
>
>> 
>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  kernel/signal.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 7b49c31d3fdb..e445b0a63faa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
>> siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  /* POSIX.1b doesn't mention process groups.  */
>>  return kill_proc_info(sig, info, pid);
>> @@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
>> int sig, siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  return do_send_specific(tgid, pid, sig, info);
>>  }

Re: [REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-10-05 Thread Eric W. Biederman

Andrei Vagin  writes:

> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>> The kernel needs to validate that the contents of struct siginfo make
>> sense as siginfo is copied into the kernel, so that the proper union
>> members can be put in the appropriate locations.  The field si_signo
>> is a fundamental part of that validation.  As such changing the
>> contents of si_signo after the validation make no sense and can result
>> in nonsense values in the kernel.
>
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.

I wanted to say look at copy_siginfo_from_user32 it uses si_signo and it
has always done so.  But before I got to fixing copy_siginfo_from_user32
only looked at si_code.  This is one of the areas where we deliberately
slightly changed the KABI.  To start looking an signo instead of magic
kernel internal si_code values.

So yes.  Looking at si_signo instead of the passed in signo appears to
be a regression all of the way around (except for ptrace) where that
value is not present.  So I will see if I can figure out how to refactor
the code to accomplish that.

I am travelling for the next couple of days so I won't be able to get to
this for a bit.

I am thinking I will need two version of copy_siginfo_from user now.
One for ptrace and one for rt_sgiqueueinfo.  Sigh.

> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
>> 
>> As such simply fail if someone is silly enough to set si_signo out of
>> sync with the signal number passed to sigqueueinfo.
>> 
>> I don't expect a problem as glibc's sigqueue implementation sets
>> "si_signo = sig" and CRIU just returns to the kernel what the kernel
>> gave to it.
>> 
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.
>
>> 
>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  kernel/signal.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 7b49c31d3fdb..e445b0a63faa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
>> siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  /* POSIX.1b doesn't mention process groups.  */
>>  return kill_proc_info(sig, info, pid);
>> @@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
>> int sig, siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  return do_send_specific(tgid, pid, sig, info);
>>  }

Re: [REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-10-05 Thread Eric W. Biederman

Andrei Vagin  writes:

> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>> The kernel needs to validate that the contents of struct siginfo make
>> sense as siginfo is copied into the kernel, so that the proper union
>> members can be put in the appropriate locations.  The field si_signo
>> is a fundamental part of that validation.  As such changing the
>> contents of si_signo after the validation make no sense and can result
>> in nonsense values in the kernel.
>
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
>> 
>> As such simply fail if someone is silly enough to set si_signo out of
>> sync with the signal number passed to sigqueueinfo.
>> 
>> I don't expect a problem as glibc's sigqueue implementation sets
>> "si_signo = sig" and CRIU just returns to the kernel what the kernel
>> gave to it.
>> 
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.

Fair enough.  Then this counts as a regression.  The setting in the
kernel happens in an awkward place and I will see if it can be moved
earlier.

Eric


>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  kernel/signal.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 7b49c31d3fdb..e445b0a63faa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
>> siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  /* POSIX.1b doesn't mention process groups.  */
>>  return kill_proc_info(sig, info, pid);
>> @@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
>> int sig, siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  return do_send_specific(tgid, pid, sig, info);
>>  }

Re: [REVIEW][PATCH 2/6] signal: Fail sigqueueinfo if si_signo != sig

2018-10-05 Thread Eric W. Biederman

Andrei Vagin  writes:

> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>> The kernel needs to validate that the contents of struct siginfo make
>> sense as siginfo is copied into the kernel, so that the proper union
>> members can be put in the appropriate locations.  The field si_signo
>> is a fundamental part of that validation.  As such changing the
>> contents of si_signo after the validation make no sense and can result
>> in nonsense values in the kernel.
>
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany  the  signal.   This
>argument  is  a  pointer to a structure of type siginfo_t, described in
>sigaction(2) (and defined  by  including  ).   The  caller
>should set the following fields in this structure:
>
>si_code
>   This  must  be  one of the SI_* codes in the Linux kernel source
>   file include/asm-generic/siginfo.h, with  the  restriction  that
>   the  code  must  be  negative (i.e., cannot be SI_USER, which is
>   used by the kernel to indicate a signal  sent  by  kill(2))  and
>   cannot  (since  Linux  2.6.39) be SI_TKILL (which is used by the
>   kernel to indicate a signal sent using tgkill(2)).
>
>si_pid This should be set to a process ID, typically the process ID  of
>   the sender.
>
>si_uid This  should  be set to a user ID, typically the real user ID of
>   the sender.
>
>si_value
>   This field contains the user data to accompany the signal.   For
>   more information, see the description of the last (union sigval)
>   argument of sigqueue(3).
>
>Internally, the kernel sets the si_signo field to the  value  specified
>in  sig,  so that the receiver of the signal can also obtain the signal
>number via that field.
>
>> 
>> As such simply fail if someone is silly enough to set si_signo out of
>> sync with the signal number passed to sigqueueinfo.
>> 
>> I don't expect a problem as glibc's sigqueue implementation sets
>> "si_signo = sig" and CRIU just returns to the kernel what the kernel
>> gave to it.
>> 
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.

Fair enough.  Then this counts as a regression.  The setting in the
kernel happens in an awkward place and I will see if it can be moved
earlier.

Eric


>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  kernel/signal.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 7b49c31d3fdb..e445b0a63faa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -3306,7 +3306,8 @@ static int do_rt_sigqueueinfo(pid_t pid, int sig, 
>> siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  /* POSIX.1b doesn't mention process groups.  */
>>  return kill_proc_info(sig, info, pid);
>> @@ -3354,7 +3355,8 @@ static int do_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, 
>> int sig, siginfo_t *info)
>>  (task_pid_vnr(current) != pid))
>>  return -EPERM;
>>  
>> -info->si_signo = sig;
>> +if (info->si_signo != sig)
>> +return -EINVAL;
>>  
>>  return do_send_specific(tgid, pid, sig, info);
>>  }

Re: Linux 4.19-rc4 released, an apology, and a maintainership note

2018-10-04 Thread Eric W. Biederman

Pavel Snajdr  writes:
>
> We started our organization (vpsFree.org) on top of OpenVZ patch set and are 
> now
> working to get vanilla up to the task of replacing the venerable 2.6.32-based
> OpenVZ 6 Linux-like thing. The new Code of Conduct is a guarantee for us, that
> we won't be laughed out of the room  and that our members won't be demotivated
> to contribute upstream - if we can all agree to be nice on each other; yet we
> still need you guys to tell us, when we're trying stupid things or going about
> things the wrong way, in some way that we will notice & can learn from.
> If I understand the context correctly, the previous "regime" could be the
> culprit, at least to some extent, why still don't have the VM look
> containers with vanilla.

At an implementation level namespaces and cgroups are hard.  Coming up
with a good solid design that is very maintainable and handles all of
the corner cases is difficult.  Very few people choose to do the work
of digging into the details and figuring out what is really needed.

This is not an area where you can hand hold someone.  It really takes
people who are able to work out from first principles what the code will
need to do.

Very often people will propose patches that do solve their specific case
but only do 10% or maybe 20% of what is needed for a general kernel
level solution.  For something that just works and does not cause
maintenance problems in the long run.

Someone has to deep dive and understand all of the problems and sort it
out.

That takes a person that is willing and able to stand up with all of the
rest of the kernel developers as an equal.   It requires listening to
other opinions to see where you need to change and where things are
wrong but it also requires being able to figure things out for yourself
and to come up with solid technical contributions.

I hope I have been something reasonable to work with on this front, if
not please let me know.

I know many other maintainers get hit with such a stream of bad
container ideas that they tend to stop listening.  As their priorities
are elsewhere I don't blame them.

Also don't forget that most of the time to do a good implemenation that it
requires rewriting an entire subsystem to make it container friendly.
Think of the effort that requires, especially when you are not allowed
to cause regressions in the subystem while rewriting it.

Further the only power a maintainer has is to accept patches, to listen
to people, and to express opinions that are worth listening to.  In the
midst of doing all of those things a maintainers time is limited.

You said a change in attitude gives you optimism that you can make work
with the upstream kernel.  I am glad you have optimism as overall the
kernel is a welcoming place.

At the same time there can't be guarantees that people won't be
demontivated.  If you care about the remaining kernel problems for
implementing containers, you need to realize those that are difficult
problems that don't easily admit to solutions.  That is why the problems
still remain.  To get a small flavor just look at how much work I had to
go through to sort out siginfo in the kernel which is just one very
small piece of the larger puzzle.  So please realize that sometimes
actually realizing the scope of the technical difficulties might be
demotivating in and of itself.

Similarly because maintainers have a limited amount of time there are no
guarantees how much we can help others.  We can try but people have to
meet maintainers at least half way in figuring out how things work
themselves, and sometimes there is just not enough time to say anything.
As the old saying goes: "You can lead a horse to water but you can't make
him drink".

So there are no guarantees that people won't be demotivated or that they
will learn what is necessary.  All that we can do is aim to keep
conversations polite and focused on the technical details of the project.
Which should keep things from getting unpleasant at the level of humans
interacting with humans.  I don't think that will give you greater
guarantees beyond that, and it feels like you are reading greater
guarantees into recent events.

I would like to see what you see as missing from the mainline
kernel.  But that is a topic for the containers list, and possibly for
the containers track at Linux Plumbers conference in Vancouver.

Eric

Re: Linux 4.19-rc4 released, an apology, and a maintainership note

2018-10-04 Thread Eric W. Biederman

Pavel Snajdr  writes:
>
> We started our organization (vpsFree.org) on top of OpenVZ patch set and are 
> now
> working to get vanilla up to the task of replacing the venerable 2.6.32-based
> OpenVZ 6 Linux-like thing. The new Code of Conduct is a guarantee for us, that
> we won't be laughed out of the room  and that our members won't be demotivated
> to contribute upstream - if we can all agree to be nice on each other; yet we
> still need you guys to tell us, when we're trying stupid things or going about
> things the wrong way, in some way that we will notice & can learn from.
> If I understand the context correctly, the previous "regime" could be the
> culprit, at least to some extent, why still don't have the VM look
> containers with vanilla.

At an implementation level namespaces and cgroups are hard.  Coming up
with a good solid design that is very maintainable and handles all of
the corner cases is difficult.  Very few people choose to do the work
of digging into the details and figuring out what is really needed.

This is not an area where you can hand hold someone.  It really takes
people who are able to work out from first principles what the code will
need to do.

Very often people will propose patches that do solve their specific case
but only do 10% or maybe 20% of what is needed for a general kernel
level solution.  For something that just works and does not cause
maintenance problems in the long run.

Someone has to deep dive and understand all of the problems and sort it
out.

That takes a person that is willing and able to stand up with all of the
rest of the kernel developers as an equal.   It requires listening to
other opinions to see where you need to change and where things are
wrong but it also requires being able to figure things out for yourself
and to come up with solid technical contributions.

I hope I have been something reasonable to work with on this front, if
not please let me know.

I know many other maintainers get hit with such a stream of bad
container ideas that they tend to stop listening.  As their priorities
are elsewhere I don't blame them.

Also don't forget that most of the time to do a good implemenation that it
requires rewriting an entire subsystem to make it container friendly.
Think of the effort that requires, especially when you are not allowed
to cause regressions in the subystem while rewriting it.

Further the only power a maintainer has is to accept patches, to listen
to people, and to express opinions that are worth listening to.  In the
midst of doing all of those things a maintainers time is limited.

You said a change in attitude gives you optimism that you can make work
with the upstream kernel.  I am glad you have optimism as overall the
kernel is a welcoming place.

At the same time there can't be guarantees that people won't be
demontivated.  If you care about the remaining kernel problems for
implementing containers, you need to realize those that are difficult
problems that don't easily admit to solutions.  That is why the problems
still remain.  To get a small flavor just look at how much work I had to
go through to sort out siginfo in the kernel which is just one very
small piece of the larger puzzle.  So please realize that sometimes
actually realizing the scope of the technical difficulties might be
demotivating in and of itself.

Similarly because maintainers have a limited amount of time there are no
guarantees how much we can help others.  We can try but people have to
meet maintainers at least half way in figuring out how things work
themselves, and sometimes there is just not enough time to say anything.
As the old saying goes: "You can lead a horse to water but you can't make
him drink".

So there are no guarantees that people won't be demotivated or that they
will learn what is necessary.  All that we can do is aim to keep
conversations polite and focused on the technical details of the project.
Which should keep things from getting unpleasant at the level of humans
interacting with humans.  I don't think that will give you greater
guarantees beyond that, and it feels like you are reading greater
guarantees into recent events.

I would like to see what you see as missing from the mainline
kernel.  But that is a topic for the containers list, and possibly for
the containers track at Linux Plumbers conference in Vancouver.

Eric

Re: linux-next: manual merge of the userns tree with the y2038 tree

2018-10-04 Thread Eric W. Biederman

Stephen Rothwell  writes:

> Hi Eric,
>
> Today's linux-next merge of the userns tree got a conflict in:
>
>   kernel/signal.c
>
> between commit:
>
>   49c39f8464a9 ("y2038: signal: Change rt_sigtimedwait to use 
> __kernel_timespec")
>
> from the y2038 tree and commit:
>
>   ae7795bc6187 ("signal: Distinguish between kernel_siginfo and siginfo")
>
> from the userns tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thank you.

This is good to know about.

Eric

Re: linux-next: manual merge of the userns tree with the y2038 tree

2018-10-04 Thread Eric W. Biederman

Stephen Rothwell  writes:

> Hi Eric,
>
> Today's linux-next merge of the userns tree got a conflict in:
>
>   kernel/signal.c
>
> between commit:
>
>   49c39f8464a9 ("y2038: signal: Change rt_sigtimedwait to use 
> __kernel_timespec")
>
> from the y2038 tree and commit:
>
>   ae7795bc6187 ("signal: Distinguish between kernel_siginfo and siginfo")
>
> from the userns tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thank you.

This is good to know about.

Eric

Re: Setting monotonic time?

2018-10-03 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
>> Direct access to hardware/drivers and not through an abstraction like
>> the vfs (an abstraction over block devices) can legitimately be handled
>> by hotplug events.  I unplug one keyboard I plug in another.
>> 
>> I don't know if the input layer is more of a general abstraction
>> or more of a hardware device.  I have not dug into it but my guess
>> is abstraction from what I have heard.
>> 
>> The scary difficulty here is if after restart input is reporting times
>> in CLOCK_MONOTONIC and the applications in the namespace are talking
>> about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
>> with a fixed offset the times don't match up.
>> 
>> So a time namespace absolutely needs to do is figure out how to deal
>> with all of the kernel interfaces reporting times and figure out how to
>> report them in the current time namespace.
>
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.

Yep.  That sounds right.

Eric

Re: Setting monotonic time?

2018-10-03 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
>> Direct access to hardware/drivers and not through an abstraction like
>> the vfs (an abstraction over block devices) can legitimately be handled
>> by hotplug events.  I unplug one keyboard I plug in another.
>> 
>> I don't know if the input layer is more of a general abstraction
>> or more of a hardware device.  I have not dug into it but my guess
>> is abstraction from what I have heard.
>> 
>> The scary difficulty here is if after restart input is reporting times
>> in CLOCK_MONOTONIC and the applications in the namespace are talking
>> about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
>> with a fixed offset the times don't match up.
>> 
>> So a time namespace absolutely needs to do is figure out how to deal
>> with all of the kernel interfaces reporting times and figure out how to
>> report them in the current time namespace.
>
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.

Yep.  That sounds right.

Eric

Re: [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace

2018-10-03 Thread Eric W. Biederman

Laurent Vivier  writes:

> This patch allows to have a different binftm_misc configuration
> in each container we mount binfmt_misc filesystem with mount namespace
> enabled.
>
> A container started without the CLONE_NEWNS will use the host binfmt_misc
> configuration, otherwise the container starts with an empty binfmt_misc
> interpreters list.
>
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreted without being root
> to run the binaries in this chroot.

A couple of things.
As has already been mentioned on your previous version anything that
comes through the filesystem interface needs to lookup up the local
binfmt context not through current but through file->f_dentry->d_sb.
AKA the superblock of the mounted filesystem.

As you have this coded any time a mount namespace is unshared you get a
new binfmt context.  That has a very reasonable chance of breaking
existing userspace.

A mount of binfmt_misc today from within a user namespace is not allowed
which is why I have figured that will be a nice place to trigger
creating a new binfmt context.

It is fundamentally necessary to be able to get a pointer to the binfmt
context from current.  Either stored in an existing namespace or
stored in nsproxy.  Anything else will risk breaking backwards
compatibility with existing user space for no good reason.

What is fundamentally being changed is the behavior of exec.

Changing the behavior of exec needs to be carefully contained or we risk
confusing privileged applications.

I believe your last email to James Bottomley detailed a very strong use
case for this functionality.

As the key gains over the existing kernel is unprivileged use.  As it is
the behavior of exec that is changing.  You definitely need a user
namespace involved.

So I think the simplest would be to hang the binfmt context off of a
user namespace.  But I am open to other ideas.

My primary concern is that we keep the cognitive and the maintenance
burden as small as is reasonably possible so that the costs don't out
weigh the benefit.

Eric

> Signed-off-by: Laurent Vivier 
> ---
>  fs/binfmt_misc.c | 50 +---
>  fs/mount.h   |  8 
>  fs/namespace.c   |  6 ++
>  3 files changed, 40 insertions(+), 24 deletions(-)
>
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index aa4a7a23ff99..ecb14776c759 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -38,9 +39,6 @@ enum {
>   VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -60,10 +58,7 @@ typedef struct {
>   struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -91,7 +86,7 @@ static Node *check_file(struct linux_binprm *bprm)
>   struct list_head *l;
>  
>   /* Walk all the registered handlers. */
> - list_for_each(l, ) {
> + list_for_each(l, _ns(entries)) {
>   Node *e = list_entry(l, Node, list);
>   char *s;
>   int j;
> @@ -135,15 +130,15 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   int fd_binary = -1;
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!binfmt_ns(enabled))
>   return retval;
>  
>   /* to keep locking time low, we copy the interpreter string */
> - read_lock(_lock);
> + read_lock(_ns(entries_lock));
>   fmt = check_file(bprm);
>   if (fmt)
>   dget(fmt->dentry);
> - read_unlock(_lock);
> + read_unlock(_ns(entries_lock));
>   if (!fmt)
>   return retval;
>  
> @@ -613,15 +608,15 @@ static void kill_node(Node *e)
>  {
>   struct dentry *dentry;
>  
> - write_lock(_lock);
> + write_lock(_ns(entries_lock));
>   list_del_init(>list);
> - write_unlock(_lock);
> + write_unlock(_ns(entries_lock));
>  
>   dentry = e->dentry;
>   drop_nlink(d_inode(dentry));
>   d_drop(dentry);
>   dput(dentry);
> - simple_release_fs(_mnt, _count);
> + simple_release_fs(_ns(bm_mnt), _ns(entry_count));
>  }
>  
>  /* / */
> @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const 
> char __user *buffer,
>   if (!inode)
>   goto out2;
>  
> - err = simple_pin_fs(_fs_type, _mnt, _count);
> + err = simple_pin_fs(_fs_type, _ns(bm_mnt),
> + _ns(entry_count));
>   if (err) {
>   iput(inode);
>   inode = NULL;
> @@ -730,7 +726,8 @@ static ssize_t

Re: [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace

2018-10-03 Thread Eric W. Biederman

Laurent Vivier  writes:

> This patch allows to have a different binftm_misc configuration
> in each container we mount binfmt_misc filesystem with mount namespace
> enabled.
>
> A container started without the CLONE_NEWNS will use the host binfmt_misc
> configuration, otherwise the container starts with an empty binfmt_misc
> interpreters list.
>
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreted without being root
> to run the binaries in this chroot.

A couple of things.
As has already been mentioned on your previous version anything that
comes through the filesystem interface needs to lookup up the local
binfmt context not through current but through file->f_dentry->d_sb.
AKA the superblock of the mounted filesystem.

As you have this coded any time a mount namespace is unshared you get a
new binfmt context.  That has a very reasonable chance of breaking
existing userspace.

A mount of binfmt_misc today from within a user namespace is not allowed
which is why I have figured that will be a nice place to trigger
creating a new binfmt context.

It is fundamentally necessary to be able to get a pointer to the binfmt
context from current.  Either stored in an existing namespace or
stored in nsproxy.  Anything else will risk breaking backwards
compatibility with existing user space for no good reason.

What is fundamentally being changed is the behavior of exec.

Changing the behavior of exec needs to be carefully contained or we risk
confusing privileged applications.

I believe your last email to James Bottomley detailed a very strong use
case for this functionality.

As the key gains over the existing kernel is unprivileged use.  As it is
the behavior of exec that is changing.  You definitely need a user
namespace involved.

So I think the simplest would be to hang the binfmt context off of a
user namespace.  But I am open to other ideas.

My primary concern is that we keep the cognitive and the maintenance
burden as small as is reasonably possible so that the costs don't out
weigh the benefit.

Eric

> Signed-off-by: Laurent Vivier 
> ---
>  fs/binfmt_misc.c | 50 +---
>  fs/mount.h   |  8 
>  fs/namespace.c   |  6 ++
>  3 files changed, 40 insertions(+), 24 deletions(-)
>
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index aa4a7a23ff99..ecb14776c759 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -38,9 +39,6 @@ enum {
>   VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -60,10 +58,7 @@ typedef struct {
>   struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -91,7 +86,7 @@ static Node *check_file(struct linux_binprm *bprm)
>   struct list_head *l;
>  
>   /* Walk all the registered handlers. */
> - list_for_each(l, ) {
> + list_for_each(l, _ns(entries)) {
>   Node *e = list_entry(l, Node, list);
>   char *s;
>   int j;
> @@ -135,15 +130,15 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   int fd_binary = -1;
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!binfmt_ns(enabled))
>   return retval;
>  
>   /* to keep locking time low, we copy the interpreter string */
> - read_lock(_lock);
> + read_lock(_ns(entries_lock));
>   fmt = check_file(bprm);
>   if (fmt)
>   dget(fmt->dentry);
> - read_unlock(_lock);
> + read_unlock(_ns(entries_lock));
>   if (!fmt)
>   return retval;
>  
> @@ -613,15 +608,15 @@ static void kill_node(Node *e)
>  {
>   struct dentry *dentry;
>  
> - write_lock(_lock);
> + write_lock(_ns(entries_lock));
>   list_del_init(>list);
> - write_unlock(_lock);
> + write_unlock(_ns(entries_lock));
>  
>   dentry = e->dentry;
>   drop_nlink(d_inode(dentry));
>   d_drop(dentry);
>   dput(dentry);
> - simple_release_fs(_mnt, _count);
> + simple_release_fs(_ns(bm_mnt), _ns(entry_count));
>  }
>  
>  /* / */
> @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const 
> char __user *buffer,
>   if (!inode)
>   goto out2;
>  
> - err = simple_pin_fs(_fs_type, _mnt, _count);
> + err = simple_pin_fs(_fs_type, _ns(bm_mnt),
> + _ns(entry_count));
>   if (err) {
>   iput(inode);
>   inode = NULL;
> @@ -730,7 +726,8 @@ static ssize_t

Re: Setting monotonic time?

2018-10-02 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Tue, 2 Oct 2018, Arnd Bergmann wrote:
>> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner  wrote:
>> >
>> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
>> > > In the context of process migration there is a simpler subproblem that I
>> > > think it is worth exploring if we can do something about.
>> > >
>> > > For a cluster of machines all running with synchronized
>> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
>> > > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
>> > > process migration between nodes in that cluster.
>> > >
>> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
>> > > beginning of time?  So that all of the nodes in a cluster can be in
>> > > sync?
>> > >
>> > > No change in skew just in offset for CLOCK_MONOTONIC.
>> > >
>> > > There are also dragons involved in coordinating things so that
>> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
>> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
>> > > seems work exploring all on it's own.
>> >
>> > It's used very early on in the kernel, so that would be a major surprise
>> > for many things including user space which has expectations on clock
>> > monotonic.
>> >
>> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
>> > the way you described and then in name spaces make it possible to magically
>> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>> >
>> > It still wouldn't allow to have different NTP/PTP time domains, but might
>> > be a good start to address the main migration headaches.
>> 
>> If we make CLOCK_MONOTONIC settable this way in a namespace,
>> do you think that should include device drivers that report timestamps
>> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
>> interfaces?
>
> Uurgh. That gets messy very fast.
>
>> Examples for drivers that can report timestamps are input, sound, v4l,
>> and drm. I think most of these can report stamps in either monotonic
>> or realtime base, while socket timestamps notably are always in
>> realtime.
>> 
>> We can probably get away with not setting the timebase for those
>> device drivers as long as the checkpoint/restart and migration features
>> are not expected to restore the state of an open character device
>> in that way. I don't know if that is a reasonable assumption to make
>> for the examples I listed.
>
> No idea. I'm not a container migration wizard.

Direct access to hardware/drivers and not through an abstraction like
the vfs (an abstraction over block devices) can legitimately be handled
by hotplug events.  I unplug one keyboard I plug in another.

I don't know if the input layer is more of a general abstraction
or more of a hardware device.  I have not dug into it but my guess
is abstraction from what I have heard.

The scary difficulty here is if after restart input is reporting times
in CLOCK_MONOTONIC and the applications in the namespace are talking
about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
with a fixed offset the times don't match up.

So a time namespace absolutely needs to do is figure out how to deal
with all of the kernel interfaces reporting times and figure out how to
report them in the current time namespace.

Eric

Re: Setting monotonic time?

2018-10-02 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Tue, 2 Oct 2018, Arnd Bergmann wrote:
>> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner  wrote:
>> >
>> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
>> > > In the context of process migration there is a simpler subproblem that I
>> > > think it is worth exploring if we can do something about.
>> > >
>> > > For a cluster of machines all running with synchronized
>> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
>> > > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
>> > > process migration between nodes in that cluster.
>> > >
>> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
>> > > beginning of time?  So that all of the nodes in a cluster can be in
>> > > sync?
>> > >
>> > > No change in skew just in offset for CLOCK_MONOTONIC.
>> > >
>> > > There are also dragons involved in coordinating things so that
>> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
>> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
>> > > seems work exploring all on it's own.
>> >
>> > It's used very early on in the kernel, so that would be a major surprise
>> > for many things including user space which has expectations on clock
>> > monotonic.
>> >
>> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
>> > the way you described and then in name spaces make it possible to magically
>> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>> >
>> > It still wouldn't allow to have different NTP/PTP time domains, but might
>> > be a good start to address the main migration headaches.
>> 
>> If we make CLOCK_MONOTONIC settable this way in a namespace,
>> do you think that should include device drivers that report timestamps
>> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
>> interfaces?
>
> Uurgh. That gets messy very fast.
>
>> Examples for drivers that can report timestamps are input, sound, v4l,
>> and drm. I think most of these can report stamps in either monotonic
>> or realtime base, while socket timestamps notably are always in
>> realtime.
>> 
>> We can probably get away with not setting the timebase for those
>> device drivers as long as the checkpoint/restart and migration features
>> are not expected to restore the state of an open character device
>> in that way. I don't know if that is a reasonable assumption to make
>> for the examples I listed.
>
> No idea. I'm not a container migration wizard.

Direct access to hardware/drivers and not through an abstraction like
the vfs (an abstraction over block devices) can legitimately be handled
by hotplug events.  I unplug one keyboard I plug in another.

I don't know if the input layer is more of a general abstraction
or more of a hardware device.  I have not dug into it but my guess
is abstraction from what I have heard.

The scary difficulty here is if after restart input is reporting times
in CLOCK_MONOTONIC and the applications in the namespace are talking
about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
with a fixed offset the times don't match up.

So a time namespace absolutely needs to do is figure out how to deal
with all of the kernel interfaces reporting times and figure out how to
report them in the current time namespace.

Eric

Setting monotonic time?

2018-10-01 Thread Eric W. Biederman



In the context of process migration there is a simpler subproblem that I
think it is worth exploring if we can do something about.

For a cluster of machines all running with synchronized
clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
machines.   Not having a matching CLOCK_MONOTONIC prevents successful
process migration between nodes in that cluster.

Would it be possible to allow setting CLOCK_MONOTONIC at the very
beginning of time?  So that all of the nodes in a cluster can be in
sync?

No change in skew just in offset for CLOCK_MONOTONIC.

There are also dragons involved in coordinating things so that
CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
know if allowing CLOCK_MONOTONIC to be set would be practical but it
seems work exploring all on it's own.

Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
your problem that is you are looking at a time namespace to solve?

Eric

Setting monotonic time?

2018-10-01 Thread Eric W. Biederman



In the context of process migration there is a simpler subproblem that I
think it is worth exploring if we can do something about.

For a cluster of machines all running with synchronized
clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
machines.   Not having a matching CLOCK_MONOTONIC prevents successful
process migration between nodes in that cluster.

Would it be possible to allow setting CLOCK_MONOTONIC at the very
beginning of time?  So that all of the nodes in a cluster can be in
sync?

No change in skew just in offset for CLOCK_MONOTONIC.

There are also dragons involved in coordinating things so that
CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
know if allowing CLOCK_MONOTONIC to be set would be practical but it
seems work exploring all on it's own.

Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
your problem that is you are looking at a time namespace to solve?

Eric

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner  writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>> 
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around?  I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.

Oh.  Yes.  Definitely.  A key enabler of any namespace implementation is
figuring out how to tame the dragons.

>> Please pardon me for thinking out load.
>> 
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>> 
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>> 
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> though

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner  writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>> 
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around?  I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.

Oh.  Yes.  Definitely.  A key enabler of any namespace implementation is
figuring out how to tame the dragons.

>> Please pardon me for thinking out load.
>> 
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>> 
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>> 
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> though

Re: [RFC 0/2] ns: introduce binfmt_misc namespace

2018-10-01 Thread Eric W. Biederman

Laurent Vivier  writes:

> Le 01/10/2018 à 09:21, Eric W. Biederman a écrit :
>> Andy Lutomirski  writes:
>> 
>>> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier  wrote:
>>>>
>>>> This series introduces a new namespace for binfmt_misc.
>>>>
>>>
>>> This seems conceptually quite reasonable, but I'm wondering if the
>>> number of namespace types is getting out of hand given the current
>>> API.  Should we be considering whether we need a new set of namespace
>>> creation APIs that scale better to larger numbers of namespace types?
>> 
>> I would rather encourage a way to make this part of an existing
>> namespace or find a way to make a mount of binfmt_misc control this.
>> 
>> Hmm.  This looks like something that can be very straight forwardly be
>> made part of the user namespace.  If you ever mount binfmt_misc in the
>> user namespace you get the new behavior.  Otherwise you get the existing
>> behavior.
>
> Thank you. I'll do that.
>
>> A user namespace will definitely be required, as otherwise you run the
>> risk of confusing root (and suid root exectuables0 by being able to
>> change the behavior of executables.
>> 
>> What is the motivation for this?  My impression is that very few people
>> tweak binfmt_misc.
>
> I think more and more people are using an interpreter like qemu
> linux-usermode to have a cross-compilation environment: they bootstrap a
> distro filesystems (with something like debootstrap), and then use
> binfmt_misc to run the compiler inside this environment (see for
> instance [1] [2] [3] or [4] [5]). This is interesting because you have
> more than a cross-compiler with that: you have also all the libraries of
> the target system, you can select exactly which target release you want
> to build to, with the exact same compiler and libraries versions (and
> you can re-use it you want to do maintenance on your project 10 years
> later...)
>
> The problem with this is you need to be root:
> 1- to chroot
> 2- to configure binfmt_misc
>
> We already can use "unshare --map-root-user chroot" to address the point
> 1, and this series tries to address the point 2.
>
> I think it's also interesting to have a per container configuration for
> binfmt_misc when the server administrator configures it and don't want
> to share each user configuration with all the other user ones (in
> something like docker or a cloud application).

OK.  So it sounds like you are already needing a user namespace for
this.   If this is your use case then my proposed method above seems to
fit rather well.  James Bottomley was doing something similar that
connected to personality(2).  That might be worth a look to see if there
is some synergy there.

>> I also don't think this raises to the level where it makes sense to
>> create a new namespace for this.
>
> OK.
>
> Thanks,
> Laurent
>
> [1] https://wiki.debian.org/Arm64Qemu
> [2] https://wiki.debian.org/M68k/sbuildQEMU
> [3] https://wiki.debian.org/RISC-V#Manual_qemu-user_installation
> [4] https://kbeckmann.github.io/2017/05/26/QEMU-instead-of-cross-compiling/
> [5] https://wiki.gentoo.org/wiki/Crossdev_qemu-static-user-chroot

Re: [RFC 0/2] ns: introduce binfmt_misc namespace

2018-10-01 Thread Eric W. Biederman

Laurent Vivier  writes:

> Le 01/10/2018 à 09:21, Eric W. Biederman a écrit :
>> Andy Lutomirski  writes:
>> 
>>> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier  wrote:
>>>>
>>>> This series introduces a new namespace for binfmt_misc.
>>>>
>>>
>>> This seems conceptually quite reasonable, but I'm wondering if the
>>> number of namespace types is getting out of hand given the current
>>> API.  Should we be considering whether we need a new set of namespace
>>> creation APIs that scale better to larger numbers of namespace types?
>> 
>> I would rather encourage a way to make this part of an existing
>> namespace or find a way to make a mount of binfmt_misc control this.
>> 
>> Hmm.  This looks like something that can be very straight forwardly be
>> made part of the user namespace.  If you ever mount binfmt_misc in the
>> user namespace you get the new behavior.  Otherwise you get the existing
>> behavior.
>
> Thank you. I'll do that.
>
>> A user namespace will definitely be required, as otherwise you run the
>> risk of confusing root (and suid root exectuables0 by being able to
>> change the behavior of executables.
>> 
>> What is the motivation for this?  My impression is that very few people
>> tweak binfmt_misc.
>
> I think more and more people are using an interpreter like qemu
> linux-usermode to have a cross-compilation environment: they bootstrap a
> distro filesystems (with something like debootstrap), and then use
> binfmt_misc to run the compiler inside this environment (see for
> instance [1] [2] [3] or [4] [5]). This is interesting because you have
> more than a cross-compiler with that: you have also all the libraries of
> the target system, you can select exactly which target release you want
> to build to, with the exact same compiler and libraries versions (and
> you can re-use it you want to do maintenance on your project 10 years
> later...)
>
> The problem with this is you need to be root:
> 1- to chroot
> 2- to configure binfmt_misc
>
> We already can use "unshare --map-root-user chroot" to address the point
> 1, and this series tries to address the point 2.
>
> I think it's also interesting to have a per container configuration for
> binfmt_misc when the server administrator configures it and don't want
> to share each user configuration with all the other user ones (in
> something like docker or a cloud application).

OK.  So it sounds like you are already needing a user namespace for
this.   If this is your use case then my proposed method above seems to
fit rather well.  James Bottomley was doing something similar that
connected to personality(2).  That might be worth a look to see if there
is some synergy there.

>> I also don't think this raises to the level where it makes sense to
>> create a new namespace for this.
>
> OK.
>
> Thanks,
> Laurent
>
> [1] https://wiki.debian.org/Arm64Qemu
> [2] https://wiki.debian.org/M68k/sbuildQEMU
> [3] https://wiki.debian.org/RISC-V#Manual_qemu-user_installation
> [4] https://kbeckmann.github.io/2017/05/26/QEMU-instead-of-cross-compiling/
> [5] https://wiki.gentoo.org/wiki/Crossdev_qemu-static-user-chroot

Re: [RFC 0/2] ns: introduce binfmt_misc namespace

2018-10-01 Thread Eric W. Biederman

Andy Lutomirski  writes:

> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier  wrote:
>>
>> This series introduces a new namespace for binfmt_misc.
>>
>
> This seems conceptually quite reasonable, but I'm wondering if the
> number of namespace types is getting out of hand given the current
> API.  Should we be considering whether we need a new set of namespace
> creation APIs that scale better to larger numbers of namespace types?

I would rather encourage a way to make this part of an existing
namespace or find a way to make a mount of binfmt_misc control this.

Hmm.  This looks like something that can be very straight forwardly be
made part of the user namespace.  If you ever mount binfmt_misc in the
user namespace you get the new behavior.  Otherwise you get the existing
behavior.

A user namespace will definitely be required, as otherwise you run the
risk of confusing root (and suid root exectuables0 by being able to
change the behavior of executables.

What is the motivation for this?  My impression is that very few people
tweak binfmt_misc.

I also don't think this raises to the level where it makes sense to
create a new namespace for this.

Eric

Re: [RFC 0/2] ns: introduce binfmt_misc namespace

2018-10-01 Thread Eric W. Biederman

Andy Lutomirski  writes:

> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier  wrote:
>>
>> This series introduces a new namespace for binfmt_misc.
>>
>
> This seems conceptually quite reasonable, but I'm wondering if the
> number of namespace types is getting out of hand given the current
> API.  Should we be considering whether we need a new set of namespace
> creation APIs that scale better to larger numbers of namespace types?

I would rather encourage a way to make this part of an existing
namespace or find a way to make a mount of binfmt_misc control this.

Hmm.  This looks like something that can be very straight forwardly be
made part of the user namespace.  If you ever mount binfmt_misc in the
user namespace you get the new behavior.  Otherwise you get the existing
behavior.

A user namespace will definitely be required, as otherwise you run the
risk of confusing root (and suid root exectuables0 by being able to
change the behavior of executables.

What is the motivation for this?  My impression is that very few people
tweak binfmt_misc.

I also don't think this raises to the level where it makes sense to
create a new namespace for this.

Eric

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>>tick_do_update_jiffies64
>>   update_wall_time
>>   timekeeping_advance
>>  timekeepging_update
>> 
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>> 
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.

The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around?  I have not dug
deeply enough into the code to see that yet.

> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

Then it sounds like this will take some more digging.

Please pardon me for thinking out load.

There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.

Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.

There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.

I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
   real time.

   The problem is that CLOCK_MONOTONIC between nodes in the cluster
   has not relation ship to each other (except a synchronized length of
   the second).  So applications that migrate can see CLOCK_MONOTONIC
   and CLOCK_BOOTTIME go backwards.

   This is the truly pressing problem and adding some kind of offset
   sounds like it would be the solution.  Possibly by allowing a boot
   time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.

2) Dealing with two separate time management domains.  Say a machine
   that needes to deal with both something inside of google where they
   slew time to avoid leap time seconds and something in the outside
   world proper UTC time is kept as an offset from TAI with the
   occasional leap seconds.

   In the later case it would fundamentally require having seconds of
   different length.

A pure 64bit nanoseond counter is good for 500 years.  So 64bit
variables can be used to hold time, and everything can be converted from
there.

This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.

That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.

If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited.  Just deal the accounting needed to cope with
tick rollover.

Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has ro

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>>tick_do_update_jiffies64
>>   update_wall_time
>>   timekeeping_advance
>>  timekeepging_update
>> 
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>> 
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.

The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around?  I have not dug
deeply enough into the code to see that yet.

> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

Then it sounds like this will take some more digging.

Please pardon me for thinking out load.

There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.

Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.

There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.

I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
   real time.

   The problem is that CLOCK_MONOTONIC between nodes in the cluster
   has not relation ship to each other (except a synchronized length of
   the second).  So applications that migrate can see CLOCK_MONOTONIC
   and CLOCK_BOOTTIME go backwards.

   This is the truly pressing problem and adding some kind of offset
   sounds like it would be the solution.  Possibly by allowing a boot
   time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.

2) Dealing with two separate time management domains.  Say a machine
   that needes to deal with both something inside of google where they
   slew time to avoid leap time seconds and something in the outside
   world proper UTC time is kept as an offset from TAI with the
   occasional leap seconds.

   In the later case it would fundamentally require having seconds of
   different length.

A pure 64bit nanoseond counter is good for 500 years.  So 64bit
variables can be used to hold time, and everything can be converted from
there.

This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.

That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.

If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited.  Just deal the accounting needed to cope with
tick rollover.

Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has ro

Re: [PATCH V6 08/33] csky: Process management and Signal

2018-09-27 Thread Eric W. Biederman

Guo Ren  writes:

> --- /dev/null
> +++ b/arch/csky/abiv2/fpu.c
> +void fpu_fpe(struct pt_regs * regs)
> +{
> + int sig;
> + unsigned int fesr;
> + siginfo_t info;
> +
> + fesr = mfcr("cr<2, 2>");
> +
> + if(fesr & FPE_ILLE){
> + info.si_code = ILL_ILLOPC;
> + sig = SIGILL;
> + }
> + else if(fesr & FPE_IDC){
> + info.si_code = ILL_ILLOPN;
> + sig = SIGILL;
> + }
> + else if(fesr & FPE_FEC){
> + sig = SIGFPE;
> + if(fesr & FPE_IOC){
> + info.si_code = FPE_FLTINV;
> + }
> + else if(fesr & FPE_DZC){
> + info.si_code = FPE_FLTDIV;
> + }
> + else if(fesr & FPE_UFC){
> + info.si_code = FPE_FLTUND;
> + }
> + else if(fesr & FPE_OFC){
> + info.si_code = FPE_FLTOVF;
> + }
> + else if(fesr & FPE_IXC){
> + info.si_code = FPE_FLTRES;
> + }
> + else {
> + info.si_code = NSIGFPE;
> + }
> + }
> + else {
> + info.si_code = NSIGFPE;
> + sig = SIGFPE;
> + }
> + info.si_signo = SIGFPE;
> + info.si_errno = 0;
> + info.si_addr = (void *)regs->pc;
> + force_sig_info(sig, , current);
> +}


This use of sending a signal is buggy.  It results in undefined values
being copied to userspace.

Userspace should never be sent NSIGXXX as a si_code.  You can use
FPE_FLTUNK for this default case.

In new code please use force_sig_fault instead of force_sig_info in new
code.  That saves you the trouble of messing with struct siginfo.

Thank you very much,
Eric Biederman

Re: [PATCH V6 08/33] csky: Process management and Signal

2018-09-27 Thread Eric W. Biederman

Guo Ren  writes:

> --- /dev/null
> +++ b/arch/csky/abiv2/fpu.c
> +void fpu_fpe(struct pt_regs * regs)
> +{
> + int sig;
> + unsigned int fesr;
> + siginfo_t info;
> +
> + fesr = mfcr("cr<2, 2>");
> +
> + if(fesr & FPE_ILLE){
> + info.si_code = ILL_ILLOPC;
> + sig = SIGILL;
> + }
> + else if(fesr & FPE_IDC){
> + info.si_code = ILL_ILLOPN;
> + sig = SIGILL;
> + }
> + else if(fesr & FPE_FEC){
> + sig = SIGFPE;
> + if(fesr & FPE_IOC){
> + info.si_code = FPE_FLTINV;
> + }
> + else if(fesr & FPE_DZC){
> + info.si_code = FPE_FLTDIV;
> + }
> + else if(fesr & FPE_UFC){
> + info.si_code = FPE_FLTUND;
> + }
> + else if(fesr & FPE_OFC){
> + info.si_code = FPE_FLTOVF;
> + }
> + else if(fesr & FPE_IXC){
> + info.si_code = FPE_FLTRES;
> + }
> + else {
> + info.si_code = NSIGFPE;
> + }
> + }
> + else {
> + info.si_code = NSIGFPE;
> + sig = SIGFPE;
> + }
> + info.si_signo = SIGFPE;
> + info.si_errno = 0;
> + info.si_addr = (void *)regs->pc;
> + force_sig_info(sig, , current);
> +}


This use of sending a signal is buggy.  It results in undefined values
being copied to userspace.

Userspace should never be sent NSIGXXX as a si_code.  You can use
FPE_FLTUNK for this default case.

In new code please use force_sig_fault instead of force_sig_info in new
code.  That saves you the trouble of messing with struct siginfo.

Thank you very much,
Eric Biederman

Re: [PATCH v14 09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX

2018-09-27 Thread Eric W. Biederman

Jarkko Sakkinen  writes:

> From: Sean Christopherson 
>
> Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the
> error code.  The PF_SGX bit is set if and only if the #PF is detected
> by the Enclave Page Cache Map (EPCM), which is consulted only after
> an access walks the kernel's page tables, i.e.:
>
>   a. the access was allowed by the kernel
>   b. the kernel's tables have become less restrictive than the EPCM
>   c. the kernel cannot fixup the cause of the fault
>
> Noteably, (b) implies that either the kernel has botched the EPC
> mappings or the EPCM has been invalidated due to a power event.  In
> either case, userspace needs to be alerted so that it can take
> appropriate action, e.g. restart the enclave.  This is reinforced
> by (c) as the kernel doesn't really have any other reasonable option,
> e.g. we could kill the task or panic, but neither is warranted.
>
> Signed-off-by: Sean Christopherson 
> Cc: Dave Hansen 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/mm/fault.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 85d20516b2f3..3fb2b2838d6c 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -960,10 +960,13 @@ static noinline void
>  bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, struct vm_area_struct *vma)
>  {
> + int si_code = SEGV_ACCERR;
> +
>   if (bad_area_access_from_pkeys(error_code, vma))
> - __bad_area(regs, error_code, address, vma, SEGV_PKUERR);
> - else
> - __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
> + si_code = SEGV_PKUERR;
> + else if (unlikely(error_code & X86_PF_SGX))
> + si_code = SEGV_SGXERR;
> + __bad_area(regs, error_code, address, vma, si_code);
>  }

This conflicts with a cleanup in this area I have sitting in linux-next.
It isn't in the x86 tree but you can find my siginfo tree at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
siginfo-next

In my tree bad area no longer takes a vma parameter.

If you are going to make changes to the fault handling code this cycle
please let's figure out how to build it on top of my clean ups.

Thank you,
Eric Biederman

Re: [PATCH v14 09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX

2018-09-27 Thread Eric W. Biederman

Jarkko Sakkinen  writes:

> From: Sean Christopherson 
>
> Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the
> error code.  The PF_SGX bit is set if and only if the #PF is detected
> by the Enclave Page Cache Map (EPCM), which is consulted only after
> an access walks the kernel's page tables, i.e.:
>
>   a. the access was allowed by the kernel
>   b. the kernel's tables have become less restrictive than the EPCM
>   c. the kernel cannot fixup the cause of the fault
>
> Noteably, (b) implies that either the kernel has botched the EPC
> mappings or the EPCM has been invalidated due to a power event.  In
> either case, userspace needs to be alerted so that it can take
> appropriate action, e.g. restart the enclave.  This is reinforced
> by (c) as the kernel doesn't really have any other reasonable option,
> e.g. we could kill the task or panic, but neither is warranted.
>
> Signed-off-by: Sean Christopherson 
> Cc: Dave Hansen 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/mm/fault.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 85d20516b2f3..3fb2b2838d6c 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -960,10 +960,13 @@ static noinline void
>  bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, struct vm_area_struct *vma)
>  {
> + int si_code = SEGV_ACCERR;
> +
>   if (bad_area_access_from_pkeys(error_code, vma))
> - __bad_area(regs, error_code, address, vma, SEGV_PKUERR);
> - else
> - __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
> + si_code = SEGV_PKUERR;
> + else if (unlikely(error_code & X86_PF_SGX))
> + si_code = SEGV_SGXERR;
> + __bad_area(regs, error_code, address, vma, si_code);
>  }

This conflicts with a cleanup in this area I have sitting in linux-next.
It isn't in the x86 tree but you can find my siginfo tree at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
siginfo-next

In my tree bad area no longer takes a vma parameter.

If you are going to make changes to the fault handling code this cycle
please let's figure out how to build it on top of my clean ups.

Thank you,
Eric Biederman

Re: [PATCH v14 08/19] signal: x86/sgx: Add SIGSEGV siginfo code for SGX EPCM fault

2018-09-27 Thread Eric W. Biederman

Jarkko Sakkinen  writes:

> From: Sean Christopherson 
>
> The SGX Enclave Page Cache Map (EPCM) is a hardware-managed table
> that enforces accesses to an enclave's EPC page in addition to the
> software-managed kernel page tables, i.e. the effective permissions
> for an EPC page are a logical AND of the kernel's page tables and
> the corresponding EPCM entry.  The primary purpose of the EPCM is
> to prevent a malcious or compromised kernel from attacking an enclave
> by modifying the enclave's page tables.  The EPCM entires for an
> enclave are populated when the enclave is built and verified, using
> metadata provided by the enclave that is included in the measurement
> used to verify the enclave.
>
> In normal operation of a properly functioning, non-malicious kernel
> (and enclave), the EPCM permissions will never trigger a fault, i.e.
> the kernel may make the permissions for an EPC page more restrictive,
> e.g. mark it not-present to swap out the EPC page, but the kernel will
> never make its permissions less restrictive.
>
> But, there is a legitimate scenario in which the kernel's page tables
> can become less restrictive than the EPCM: on current hardware all
> enclaves are destroyed (by hardware) on a transition to S3 or lower
> sleep states, i.e. all EPCM entries are invalid (not-present) after
> the system resumes from its sleep state.
>
> Unfortunately, on CPUs that support only SGX1, EPCM violations result
> in a #GP.  The upside of the #GP is that no kernel changes are needed
> to deal with the EPCM being blasted away by hardware, e.g. userspace
> gets a SIGSEGV, assumes the EPCM was lost and restarts its enclave
> and everyone is happy.  The downside is that userspace has to assume
> the SIGSEGV was because the EPC was lost (or possibly do some leg work
> to rule out other causes).
>
> In SGX2, the oddity of delivering a #GP due to what are inherently
> paging related violations is remedied.  CPUs that support SGX2 deliver
> EPCM violations as #PFs with a new SGX error code bit set.  So, now
> that hardware provides us with a way to unequivocally determine that
> a fault was due to a EPCM violation, define a signfo code for SIGSEGV
> so that the information can be passed onto userspace.
>
> Cc: Dave Hansen 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  include/uapi/asm-generic/siginfo.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/asm-generic/siginfo.h 
> b/include/uapi/asm-generic/siginfo.h
> index 80e2a7227205..fdd898e2325b 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -225,7 +225,11 @@ typedef struct siginfo {
>  #else
>  # define SEGV_PKUERR 4   /* failed protection key checks */
>  #endif
> +#ifdef __x86_64__
> +#define SEGV_SGXERR  5   /* SGX Enclave Page Cache Map fault */
> +#else
>  #define SEGV_ACCADI  5   /* ADI not enabled for mapped object */
> +#endif

Don't do this crazy ifdef thing.  si_codes are not supposed to be per
architecture.  There are a few historical bugs but with a 32bit space
it is just stupid to add #ifdefs.

Just set.
#define SEGV_SGXERR 8 and increase NSIGSEGV

Anything else is just asking for trouble.  Especially when you want to
get SGX working on itaninum.

>  #define SEGV_ADIDERR 6   /* Disrupting MCD error */
>  #define SEGV_ADIPERR 7   /* Precise MCD exception */
>  #define NSIGSEGV 7

Eric

Re: [PATCH v14 08/19] signal: x86/sgx: Add SIGSEGV siginfo code for SGX EPCM fault

2018-09-27 Thread Eric W. Biederman

Jarkko Sakkinen  writes:

> From: Sean Christopherson 
>
> The SGX Enclave Page Cache Map (EPCM) is a hardware-managed table
> that enforces accesses to an enclave's EPC page in addition to the
> software-managed kernel page tables, i.e. the effective permissions
> for an EPC page are a logical AND of the kernel's page tables and
> the corresponding EPCM entry.  The primary purpose of the EPCM is
> to prevent a malcious or compromised kernel from attacking an enclave
> by modifying the enclave's page tables.  The EPCM entires for an
> enclave are populated when the enclave is built and verified, using
> metadata provided by the enclave that is included in the measurement
> used to verify the enclave.
>
> In normal operation of a properly functioning, non-malicious kernel
> (and enclave), the EPCM permissions will never trigger a fault, i.e.
> the kernel may make the permissions for an EPC page more restrictive,
> e.g. mark it not-present to swap out the EPC page, but the kernel will
> never make its permissions less restrictive.
>
> But, there is a legitimate scenario in which the kernel's page tables
> can become less restrictive than the EPCM: on current hardware all
> enclaves are destroyed (by hardware) on a transition to S3 or lower
> sleep states, i.e. all EPCM entries are invalid (not-present) after
> the system resumes from its sleep state.
>
> Unfortunately, on CPUs that support only SGX1, EPCM violations result
> in a #GP.  The upside of the #GP is that no kernel changes are needed
> to deal with the EPCM being blasted away by hardware, e.g. userspace
> gets a SIGSEGV, assumes the EPCM was lost and restarts its enclave
> and everyone is happy.  The downside is that userspace has to assume
> the SIGSEGV was because the EPC was lost (or possibly do some leg work
> to rule out other causes).
>
> In SGX2, the oddity of delivering a #GP due to what are inherently
> paging related violations is remedied.  CPUs that support SGX2 deliver
> EPCM violations as #PFs with a new SGX error code bit set.  So, now
> that hardware provides us with a way to unequivocally determine that
> a fault was due to a EPCM violation, define a signfo code for SIGSEGV
> so that the information can be passed onto userspace.
>
> Cc: Dave Hansen 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  include/uapi/asm-generic/siginfo.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/asm-generic/siginfo.h 
> b/include/uapi/asm-generic/siginfo.h
> index 80e2a7227205..fdd898e2325b 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -225,7 +225,11 @@ typedef struct siginfo {
>  #else
>  # define SEGV_PKUERR 4   /* failed protection key checks */
>  #endif
> +#ifdef __x86_64__
> +#define SEGV_SGXERR  5   /* SGX Enclave Page Cache Map fault */
> +#else
>  #define SEGV_ACCADI  5   /* ADI not enabled for mapped object */
> +#endif

Don't do this crazy ifdef thing.  si_codes are not supposed to be per
architecture.  There are a few historical bugs but with a 32bit space
it is just stupid to add #ifdefs.

Just set.
#define SEGV_SGXERR 8 and increase NSIGSEGV

Anything else is just asking for trouble.  Especially when you want to
get SGX working on itaninum.

>  #define SEGV_ADIDERR 6   /* Disrupting MCD error */
>  #define SEGV_ADIPERR 7   /* Precise MCD exception */
>  #define NSIGSEGV 7

Eric

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 10702 matches

Mail list logo