Re: [PATCH v5 0/4] man2: udpate mm/userfaultfd manpages to latest

2021-04-05 Thread Michael Kerrisk (man-pages)
Hi Alex,

> I applied all 4 patches (with a few minor fixes to 1/4 and 4/4 (cosmetic 
> fixes; some of them about the 80-col right margin)): 
> 

How big is your current queue of pending patches from others?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


man-pages-5.11 released

2021-03-22 Thread Michael Kerrisk (man-pages)
Gidday,

Alex Colomar and I are proud to announce:

man-pages-5.11 - man pages for Linux

This release resulted from patches, bug reports, reviews, and
comments from around 40 contributors. The release includes
around 480 commits that changed 950 (about 90% of the) pages.
With a 50k diff, this is one of the largest man-pages releases
in quite a long time.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_5.11

A short summary of the release is blogged at:
https://linux-man-pages.blogspot.com/2021/03/man-pages-511-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers of LKML is shown below.

Cheers,

Michael

 Changes in man-pages-5.11 

Released: 2021-03-21, Munich


New and rewritten pages
---

close_range.2
Stephen Kitt, Michael Kerrisk  [Christian Brauner]
New page documenting close_range(2)

process_madvise.2
Suren Baghdasaryan, Minchan Kim  [Michal Hocko, Alejandro Colomar,
Michael Kerrisk]
Document process_madvise(2)

fileno.3
Michael Kerrisk
Split fileno(3) content out of ferror(3) into new page
fileno(3) differs from the other functions in various ways.
For example, it is governed by different standards,
and can set 'errno'. Conversely, the other functions
are about examining the status of a stream, while
fileno(3) simply obtains the underlying file descriptor.
Furthermore, splitting this function out allows
for some cleaner upcoming changes in ferror(3).


Newly documented interfaces in existing pages
-

epoll_wait.2
Willem de Bruijn  [Dmitry V. Levin]
Add documentation of epoll_pwait2()
Expand the epoll_wait() page with epoll_pwait2(), an epoll_wait()
variant that takes a struct timespec to enable nanosecond
resolution timeout.

fanotify_init.2
fanotify.7
Jan Kara  [Steve Grubb]
Document FAN_AUDIT flag and FAN_ENABLE_AUDIT

madvise.2
Michael Kerrisk
Add descriptions of MADV_COLD and MADV_PAGEOUT
Taken from process_madvise(2).

openat2.2
Jens Axboe
Add RESOLVE_CACHED

prctl.2
Gabriel Krisman Bertazi
Document Syscall User Dispatch

mallinfo.3
Michael Kerrisk
Document mallinfo2() and note that mallinfo() is deprecated
Document the mallinfo2() function added in glibc 2.33.
Update example program to use mallinfo2()

system_data_types.7
Alejandro Colomar
Add off64_t to system_data_types(7)

ld.so.8
Michael Kerrisk
Document the --argv0 option added in glibc 2.33


Global changes
--

Various pages
Alejandro Colomar
SYNOPSIS: Use 'restrict' in prototypes
This change has been completed for *all* relevant pages
(around 135 pages in total).

Various pages
Alejandro Colomar  [Zack Weinberg]
Remove unused 
The manual pages are already inconsistent in which headers need
to be included.  Right now, not all of the types used by a
function have their required header included in the SYNOPSIS.

If we were to add the headers required by all of the types used by
functions, the SYNOPSIS would grow too much.  Not only it would
grow too much, but the information there would be less precise.

Having system_data_types(7) document each type with all the
information about required includes is much more precise, and the
info is centralized so that it's much easier to maintain.

So let's document only the include required for the function
prototype, and also the ones required for the macros needed to
call the function.

 only defines types, not functions or constants, so
it doesn't belong to man[23] (function) pages at all.

I ignore if some old systems had headers that required you to
include  *before* them (incomplete headers), but if
so, those implementations would be broken, and those headers
should probably provide some kind of warning.  I hope this is not
the case.

[mtk: Already in 2001, POSIX.1 removed the requirement to
include  for many APIs, so this patch seems
well past due.]

_exit.2
abort.3
err.3
exit.3
pthread_exit.3
setjmp.3
Alejandro Colomar
SYNOPSIS: Use 'noreturn' in prototypes
Use standard C11 'noreturn' in these manual page for
functions that do not return.


Changes to 

Re: [PATCH v6] close_range.2: new page documenting close_range(2)

2021-03-21 Thread Michael Kerrisk (man-pages)
On 3/9/21 8:53 PM, Stephen Kitt wrote:
> Hi Michael,
> 
> On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
>  wrote:
>> Thanks for your patch revision. I've merged it, and have
>> done some light editing, but I still have a question:
> 
> Does this need anything more? I don’t see it in the man-pages repo.

Sorry, Stephen. It's just me being slow. I've made a few edits,
replaced the example program with another that more clearly allows
the user to see what's going on, and pushed to Git.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v6] close_range.2: new page documenting close_range(2)

2021-03-21 Thread Michael Kerrisk (man-pages)
Hello Stephen and Christian,

Late follow-up, I'm afraid...

On 1/29/21 11:00 AM, Christian Brauner wrote:
> On Thu, Jan 28, 2021 at 11:10:40PM +0100, Stephen Kitt wrote:
>> Hello Michael,
>>
>> On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
>>  wrote:
>>> Thanks for your patch revision. I've merged it, and have
>>> done some light editing, but I still have a question:
>>>
>>> On 1/23/21 5:11 PM, Stephen Kitt wrote:
>>>
>>> [...]
>>>
>>>> +.SH ERRORS  
>>>
>>>> +.TP
>>>> +.B EMFILE
>>>> +The per-process limit on the number of open file descriptors has been
>>>> reached +(see the description of
>>>> +.B RLIMIT_NOFILE
>>>> +in
>>>> +.BR getrlimit (2)).  
>>>
>>> I think there was already a question about this error, but
>>> I still have a doubt.
>>>
>>> A glance at the code tells me that indeed EMFILE can occur.
>>> But how can the reason be because the limit on the number
>>> of open file descriptors has been reached? I mean: no new
>>> FDs are being opened, so how can we go over the limit. I think
>>> the cause of this error is something else, but what is it?
>>
>> Here’s how I understand the code that can lead to EMFILE:
>>
>> * in __close_range(), if CLOSE_RANGE_UNSHARE is set, call unshare_fd() with
>>   CLONE_FILES to clone the fd table
>> * unshare_fd() calls dup_fd()
>> * dup_fd() allocates a new fdtable, and if the resulting fdtable ends up
>>   being too small to hold the number of fds calculated by
>>   sane_fdtable_size(), fails with EMFILE
>>
>> I suspect that, given that we’re starting with a valid fdtable, the only way
>> this can happen is if there’s a race with sysctl_nr_open being reduced.
> 
> Yes, and sysctls are racy by nature.

Got it, I think. I changed the error text here to:

   EMFILE The number of open file descriptors exceeds the limit spec‐
  ified in /proc/sys/fs/nr_open (see  proc(5)).   This  error
  can occur in situations where that limit was lowered before
  a call to close_range() where the CLOSE_RANGE_UNSHARE  flag
  is specified.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [RFC v2] execve.2: SYNOPSIS: Document both glibc wrapper and kernel sycalls

2021-02-19 Thread Michael Kerrisk (man-pages)
Hey Alex,

On 2/18/21 4:13 PM, Alejandro Colomar wrote:
> Until now, the manual pages have (usually) documented only either
> the glibc (or another library) wrapper for a syscall, or the
> kernel syscall (this only when there's not a wrapper).
> 
> Let's document both prototypes, which many times are slightly
> different.  This will solve a problem where documenting glibc
> wrappers implied shadowing the documentation for the raw syscall.
> 
> Signed-off-by: Alejandro Colomar 

This patch also changes madvise.2, I suppose accidentally.

I'm still not sure whether I consider this change worthwhile
for cases like this where the differences between the libc
wrapper and the syscall are minor enough to probably
be irrelevant to user-space programmers. But, if we do
add something like this, I thing a sentence or two
of English is desirable as well. Something like

   The kernel system call differs slightly from the glibc
   wrapper, in the addition of 'const' to two parameter
   declarations:

syscall(...)

But, before we go down this track, I'd like to get a sense 
of how many cases there are like this where there are these
small differences between the glibc wrapper and the syscall
interface. I'm not meaning you should check every system call
now.  But maybe you can let me know something like: of the first
20 system calls I checked, there X system calls that had 
such differences.

Thanks,

Michael

> ---
>  man2/execve.2 | 15 +--
>  man2/membarrier.2 | 14 +-
>  2 files changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/man2/execve.2 b/man2/execve.2
> index 027a0efd2..318c71c85 100644
> --- a/man2/execve.2
> +++ b/man2/execve.2
> @@ -41,8 +41,8 @@ execve \- execute program
>  .nf
>  .B #include 
>  .PP
> -.BI "int execve(const char *" pathname ", char *const " argv [],
> -.BI "   char *const " envp []);
> +.BI "int execve(const char *" pathname ",
> +.BI "   char *const " argv "[], char *const " envp []);
>  .fi
>  .SH DESCRIPTION
>  .BR execve ()
> @@ -772,6 +772,17 @@ Thus, this argument list was not directly usable in a 
> further
>  .BR exec ()
>  call.
>  Since UNIX\ V7, both are NULL.
> +.SS C library/kernel differences
> +.RS 4
> +.nf
> +/* Kernel system call: */
> +.BR "#include " "/* For " SYS_* " constants */"
> +.B #include 
> +.PP
> +.BI "int syscall(SYS_execve, const char *" pathname ,
> +.BI "const char *const " argv "[], const char *const " envp []);
> +.fi
> +.RE
>  .\"
>  .\" .SH BUGS
>  .\" Some Linux versions have failed to check permissions on ELF
> diff --git a/man2/membarrier.2 b/man2/membarrier.2
> index 173195484..25d6add77 100644
> --- a/man2/membarrier.2
> +++ b/man2/membarrier.2
> @@ -28,13 +28,12 @@ membarrier \- issue memory barriers on a set of threads
>  .SH SYNOPSIS
>  .nf
>  .PP
> -.B #include 
> +.BR "#include " "   /* For " MEMBARRIER_* " constants */"
> +.BR "#include " "/* For " SYS_* " constants */"
> +.B #include 
>  .PP
> -.BI "int membarrier(int " cmd ", unsigned int " flags ", int " cpu_id );
> +.BI "int syscall(SYS_membarrier, int " cmd ", unsigned int " flags ", int " 
> cpu_id );
>  .fi
> -.PP
> -.IR Note :
> -There is no glibc wrapper for this system call; see NOTES.
>  .SH DESCRIPTION
>  The
>  .BR membarrier ()
> @@ -295,7 +294,7 @@ was:
>  .PP
>  .in +4n
>  .EX
> -.BI "int membarrier(int " cmd ", int " flags );
> +.BI "int syscall(SYS_membarrier, int " cmd ", int " flags );
>  .EE
>  .in
>  .SH CONFORMING TO
> @@ -322,9 +321,6 @@ Examples where
>  .BR membarrier ()
>  can be useful include implementations
>  of Read-Copy-Update libraries and garbage collectors.
> -.PP
> -Glibc does not provide a wrapper for this system call; call it using
> -.BR syscall (2).
>  .SH EXAMPLES
>  Assuming a multithreaded application where "fast_path()" is executed
>  very frequently, and where "slow_path()" is executed infrequently, the
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [RFC] execve.2: SYNOPSIS: Document both glibc wrapper and kernel sycalls

2021-02-18 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 2/14/21 2:39 PM, Alejandro Colomar wrote:
> Until now, the manual pages have (usually) documented only either
> the glibc (or another library) wrapper for a syscall, or the raw
> syscall (this only when there's not a wrapper).
> 
> Let's document both prototypes, which many times are slightly
> different.  This will solve a problem where documenting glibc
> wrappers implied shadowing the documentation for the raw syscall.
> 
> It will also be much clearer for the reader where the syscall
> comes from (kernel? glibc? other?), by adding an explicit comment
> at the beginning of the prototypes.  This removes the need of
> scrolling down to NOTES to see that info.
> 
> Signed-off-by: Alejandro Colomar 
> ---
> 
> Hi all,
> 
> This is a prototype for doing some important changes to the SYNOPSIS
> of the man-pages.
> 
> The commit message above explains the idea quite well.  A few details
> that couldn't be shown on this commit are:
> 
> For cases where the wrapper is provided by a library other than glibc,
> I'd simply change the comment.  For example, for move_pages(2),
> it would say /* libnuma wrapper function: */.
> 
> I think this would make the samll notes warning that there's no glibc
> wrapper function deprecated (but we could keep them for some time and
> decide that later).
> 
> While changing this, I'd also make sure that the headers are correct,
> and clearly differentiate which headers are needed for the raw syscall
> and for the wrapper function.
> 
> This change will probably take more than one release of the man-pages
> to complete.
> 
> Any thoughts?

My first impression is that I'm not keen on this. We'll add extra
text to all Section 2 pages, and in many (most?) cases the info
will be redundant (i.e., the wrapper and the syscall() notation
will express the same info). In other cases, I suspect the info
will be largely irrelevant to the user. To take an example: to 
whom will the difference that you document below for execve()
matter, how will it matter, and does it matter enough that we
headline the info in the pages? I'd want cogent answers to
those questions before considering a wide-ranging change.

There are indeed cases where the wrapper API differs in
significant ways from the syscall API (and these differences
are usually captured in the " C library/kernel differences"
subsections, such as for pselect()/pselect6() in select(2)).
But I imagine that that is the case in only a smallish
minority of the pages.

And indeed there are a very few syscalls that have wrappers
provided in another library. But it's a very small percentage
I think, and best documented case by case in specific pages.
The default presumption is that the wrapper is in the C library.

There are other cases where I think it may be worthwhile
considering the syscall() notation:

1. Where the system call has no wrapper. In that case, we might
   use the syscall() notation in the SYNOPISIS as both
   (a) a clear indication that there is no wrapper and
   (b) instructions to the reader about how to call the
   system call using syscall().

2. In cases where there is a "significant" difference between
   the wrapper and the system call. In this case, we might
   also place the syscall() notation in the SYNOPSIS, or
   (perhaps more likely) in the NOTES

Thanks,

Michael

> 
> Thanks,
> 
> Alex
> 
> ---
>  man2/execve.2 | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/execve.2 b/man2/execve.2
> index 639e3b4b9..87ff022ce 100644
> --- a/man2/execve.2
> +++ b/man2/execve.2
> @@ -39,10 +39,18 @@
>  execve \- execute program
>  .SH SYNOPSIS
>  .nf
> +/* Glibc wrapper function: */
>  .B #include 
>  .PP
> -.BI "int execve(const char *" pathname ", char *const " argv [],
> -.BI "   char *const " envp []);
> +.BI "int execve(const char *" pathname ",
> +.BI "   char *const " argv "[], char *const " envp []);
> +.PP
> + /* Raw system call: */
> +.B #include 
> +.B #include 
> +.PP
> +.BI "int syscall(SYS_execve, const char *" pathname ,
> +.BI "   const char *const " argv "[], const char *const " envp []);
>  .fi
>  .SH DESCRIPTION
>  .BR execve ()
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-18 Thread Michael Kerrisk (man-pages)
Hello Suren,

>> Thanks. I added a few words to clarify this.>
> Any link where I can see the final version?

Sure:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/process_madvise.2

Also rendered below.

Thanks,

Michael

NAME
   process_madvise - give advice about use of memory to a process

SYNOPSIS
   #include 

   ssize_t process_madvise(int pidfd, const struct iovec *iovec,
   size_t vlen, int advice,
   unsigned int flags);

   Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
   The process_madvise() system call is used to give advice or direc‐
   tions to the kernel about the address ranges of another process or
   of  the  calling  process.  It provides the advice for the address
   ranges described by iovec and vlen.  The goal of such advice is to
   improve system or application performance.

   The  pidfd  argument  is a PID file descriptor (see pidfd_open(2))
   that specifies the process to which the advice is to be applied.

   The pointer iovec points to an array of iovec structures,  defined
   in  as:

   struct iovec {
   void  *iov_base;/* Starting address */
   size_t iov_len; /* Length of region */
   };

   The iovec structure describes address ranges beginning at iov_base
   address and with the size of iov_len bytes.

   The vlen specifies the number of elements in the iovec  structure.
   This value must be less than or equal to IOV_MAX (defined in  or accessible via the call sysconf(_SC_IOV_MAX)).

   The advice argument is one of the following values:

   MADV_COLD
  See madvise(2).

   MADV_PAGEOUT
  See madvise(2).

   The flags argument is reserved for future use; currently, this ar‐
   gument must be specified as 0.

   The  vlen  and iovec arguments are checked before applying any ad‐
   vice.  If vlen is too big, or iovec is invalid, then an error will
   be returned immediately and no advice will be applied.

   The  advice might be applied to only a part of iovec if one of its
   elements points to an invalid memory region in the remote process.
   No further elements will be processed beyond that point.  (See the
   discussion regarding partial advice in RETURN VALUE.)

   Permission to apply advice to another process  is  governed  by  a
   ptrace   access   mode   PTRACE_MODE_READ_REALCREDS   check   (see
   ptrace(2)); in addition, because of the  performance  implications
   of applying the advice, the caller must have the CAP_SYS_ADMIN ca‐
   pability.

RETURN VALUE
   On success, process_madvise() returns the number of bytes advised.
   This  return  value may be less than the total number of requested
   bytes, if an error occurred after some iovec elements were already
   processed.   The caller should check the return value to determine
   whether a partial advice occurred.

   On error, -1 is returned and errno is set to indicate the error.

ERRORS
   EBADF  pidfd is not a valid PID file descriptor.

   EFAULT The memory described by iovec is outside the accessible ad‐
  dress space of the process referred to by pidfd.

   EINVAL flags is not 0.

   EINVAL The  sum of the iov_len values of iovec overflows a ssize_t
  value.

   EINVAL vlen is too large.

   ENOMEM Could not allocate memory for internal copies of the  iovec
  structures.

   EPERM  The  caller  does not have permission to access the address
  space of the process pidfd.

   ESRCH  The target process does not exist (i.e., it has  terminated
  and been waited on).

VERSIONS
   This  system  call first appeared in Linux 5.10.  Support for this
   system call is optional, depending on  the  setting  of  the  CON‐
   FIG_ADVISE_SYSCALLS configuration option.

CONFORMING TO
   The process_madvise() system call is Linux-specific.

NOTES
   Glibc does not provide a wrapper for this system call; call it us‐
   ing syscall(2).

SEE ALSO
   madvise(2),  pidfd_open(2),   process_vm_readv(2),
   process_vm_write(2)


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-13 Thread Michael Kerrisk (man-pages)
Hello Suren,

On 2/2/21 11:12 PM, Suren Baghdasaryan wrote:
> Hi Michael,
> 
> On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
>  wrote:
>>
>> Hello Suren (and Minchan and Michal)
>>
>> Thank you for the revisions!
>>
>> I've applied this patch, and done a few light edits.
> 
> Thanks!
> 
>>
>> However, I have a questions about undocumented pieces in *madvise(2)*,
>> as well as one other question. See below.
>>
>> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
>>> Initial version of process_madvise(2) manual page. Initial text was
>>> extracted from [1], amended after fix [2] and more details added using
>>> man pages of madvise(2) and process_vm_read(2) as examples. It also
>>> includes the changes to required permission proposed in [3].
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/1297933/
>>> [2] https://lkml.org/lkml/2020/12/8/1282
>>> [3] 
>>> https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
>>>
>>> Signed-off-by: Suren Baghdasaryan 
>>> Reviewed-by: Michal Hocko 
>>> ---
>>> changes in v2:
>>> - Changed description of MADV_COLD per Michal Hocko's suggestion
>>> - Applied fixes suggested by Michael Kerrisk
>>> changes in v3:
>>> - Added Michal's Reviewed-by
>>> - Applied additional fixes suggested by Michael Kerrisk
>>>
>>> NAME
>>> process_madvise - give advice about use of memory to a process
>>>
>>> SYNOPSIS
>>> #include 
>>>
>>> ssize_t process_madvise(int pidfd,
>>>const struct iovec *iovec,
>>>unsigned long vlen,
>>>int advice,
>>>unsigned int flags);
>>>
>>> DESCRIPTION
>>> The process_madvise() system call is used to give advice or directions
>>> to the kernel about the address ranges of another process or the calling
>>> process. It provides the advice to the address ranges described by iovec
>>> and vlen. The goal of such advice is to improve system or application
>>> performance.
>>>
>>> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
>>> specifies the process to which the advice is to be applied.
>>>
>>> The pointer iovec points to an array of iovec structures, defined in
>>>  as:
>>>
>>> struct iovec {
>>> void  *iov_base;/* Starting address */
>>> size_t iov_len; /* Number of bytes to transfer */
>>> };
>>>
>>> The iovec structure describes address ranges beginning at iov_base 
>>> address
>>> and with the size of iov_len bytes.
>>>
>>> The vlen represents the number of elements in the iovec structure.
>>>
>>> The advice argument is one of the values listed below.
>>>
>>>   Linux-specific advice values
>>> The following Linux-specific advice values have no counterparts in the
>>> POSIX-specified posix_madvise(3), and may or may not have counterparts
>>> in the madvise(2) interface available on other implementations.
>>>
>>> MADV_COLD (since Linux 5.4.1)
>>
>> I just noticed these version numbers now, and thought: they can't be
>> right (because the system call appeared only in v5.11). So I removed
>> them. But, of course in another sense the version numbers are (nearly)
>> right, since these advice values were added for madvise(2) in Linux 5.4.
>> However, they are not documented in the madvise(2) manual page. Is it
>> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
>> meaning in madvise(2) (but just for the calling process, of course)?
> 
> Correct. They should be added in the madvise(2) man page as well IMHO.

So, I decided to move the description of MADV_COLD and MADV_PAGEOUT
to madvise(2) and refer to that page from the process_madvise(2)
page. This avoids repeating the same information in two places.

>>> Deactive a given range of pages which will make them a more probable
>>
>> I changed: s/Deactive/Deactivate/
> 
> thanks!
> 
>>
>>> reclaim target should there be a memory pressure. This is a
>>> nondestructive operation. The advice might be ignored for some pages
>>> in the range when it is not applicable.
>>>
>>> MADV_P

Re: [PATCH v2] ipc.2: Fix prototype parameter types

2021-02-09 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 2/7/21 1:36 PM, Alejandro Colomar wrote:
> The types for some of the parameters are incorrect
> (different than the kernel).  Fix them.
> Below are shown the types that the kernel uses.

Thanks. Patch applied.

Cheers,

Michael

> ..
> 
> .../linux$ grep_syscall ipc
> ipc/syscall.c:110:
> SYSCALL_DEFINE6(ipc, unsigned int, call, int, first, unsigned long, second,
>   unsigned long, third, void __user *, ptr, long, fifth)
> ipc/syscall.c:205:
> COMPAT_SYSCALL_DEFINE6(ipc, u32, call, int, first, int, second,
>   u32, third, compat_uptr_t, ptr, u32, fifth)
> include/linux/compat.h:874:
> asmlinkage long compat_sys_ipc(u32, int, int, u32, compat_uptr_t, u32);
> include/linux/syscalls.h:1221:
> asmlinkage long sys_ipc(unsigned int call, int first, unsigned long second,
>   unsigned long third, void __user *ptr, long fifth);
> .../linux$
> 
> function grep_syscall()
> {
>   if ! [ -v 1 ]; then
>   >&2 echo "Usage: ${FUNCNAME[0]} ";
>   return ${EX_USAGE};
>   fi
> 
>   find * -type f \
>   |grep '\.c$' \
>   |sort -V \
>   |xargs pcregrep -Mn "(?s)^\w*SYSCALL_DEFINE.\(${1},.*?\)" \
>   |sed -E 's/^[^:]+:[0-9]+:/&\n/';
> 
>   find * -type f \
>   |grep '\.[ch]$' \
>   |sort -V \
>   |xargs pcregrep -Mn "(?s)^asmlinkage\s+[\w\s]+\**sys_${1}\s*\(.*?\)" \
>   |sed -E 's/^[^:]+:[0-9]+:/&\n/';
> }
> 
> Signed-off-by: Alejandro Colomar 
> ---
>  man2/ipc.2 | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/man2/ipc.2 b/man2/ipc.2
> index 6589ffae6..a36e895a2 100644
> --- a/man2/ipc.2
> +++ b/man2/ipc.2
> @@ -27,9 +27,8 @@
>  ipc \- System V IPC system calls
>  .SH SYNOPSIS
>  .nf
> -.BI "int ipc(unsigned int " call ", int " first ", int " second \
> -", int " third ,
> -.BI "void *" ptr ", long " fifth );
> +.BI "int ipc(unsigned int " call ", int " first ", unsigned long " second ,
> +.BI "unsigned long " third ", void *" ptr ", long " fifth );
>  .fi
>  .PP
>  .IR Note :
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-02 Thread Michael Kerrisk (man-pages)
Hello Suren (and Minchan and Michal)

Thank you for the revisions!

I've applied this patch, and done a few light edits.

However, I have a questions about undocumented pieces in *madvise(2)*,
as well as one other question. See below. 

On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> Initial version of process_madvise(2) manual page. Initial text was
> extracted from [1], amended after fix [2] and more details added using
> man pages of madvise(2) and process_vm_read(2) as examples. It also
> includes the changes to required permission proposed in [3].
> 
> [1] https://lore.kernel.org/patchwork/patch/1297933/
> [2] https://lkml.org/lkml/2020/12/8/1282
> [3] 
> https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> 
> Signed-off-by: Suren Baghdasaryan 
> Reviewed-by: Michal Hocko 
> ---
> changes in v2:
> - Changed description of MADV_COLD per Michal Hocko's suggestion
> - Applied fixes suggested by Michael Kerrisk
> changes in v3:
> - Added Michal's Reviewed-by
> - Applied additional fixes suggested by Michael Kerrisk
> 
> NAME
> process_madvise - give advice about use of memory to a process
> 
> SYNOPSIS
> #include 
> 
> ssize_t process_madvise(int pidfd,
>const struct iovec *iovec,
>unsigned long vlen,
>int advice,
>unsigned int flags);
> 
> DESCRIPTION
> The process_madvise() system call is used to give advice or directions
> to the kernel about the address ranges of another process or the calling
> process. It provides the advice to the address ranges described by iovec
> and vlen. The goal of such advice is to improve system or application
> performance.
> 
> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> specifies the process to which the advice is to be applied.
> 
> The pointer iovec points to an array of iovec structures, defined in
>  as:
> 
> struct iovec {
> void  *iov_base;/* Starting address */
> size_t iov_len; /* Number of bytes to transfer */
> };
> 
> The iovec structure describes address ranges beginning at iov_base address
> and with the size of iov_len bytes.
> 
> The vlen represents the number of elements in the iovec structure.
> 
> The advice argument is one of the values listed below.
> 
>   Linux-specific advice values
> The following Linux-specific advice values have no counterparts in the
> POSIX-specified posix_madvise(3), and may or may not have counterparts
> in the madvise(2) interface available on other implementations.
> 
> MADV_COLD (since Linux 5.4.1)

I just noticed these version numbers now, and thought: they can't be
right (because the system call appeared only in v5.11). So I removed 
them. But, of course in another sense the version numbers are (nearly)
right, since these advice values were added for madvise(2) in Linux 5.4.
However, they are not documented in the madvise(2) manual page. Is it
correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
meaning in madvise(2) (but just for the calling process, of course)?

> Deactive a given range of pages which will make them a more probable

I changed: s/Deactive/Deactivate/

> reclaim target should there be a memory pressure. This is a
> nondestructive operation. The advice might be ignored for some pages
> in the range when it is not applicable.
> 
> MADV_PAGEOUT (since Linux 5.4.1)
> Reclaim a given range of pages. This is done to free up memory 
> occupied
> by these pages. If a page is anonymous it will be swapped out. If a
> page is file-backed and dirty it will be written back to the backing
> storage. The advice might be ignored for some pages in the range when
> it is not applicable.

[...]

> The hint might be applied to a part of iovec if one of its elements points
> to an invalid memory region in the remote process. No further elements 
> will
> be processed beyond that point.

Is the above scenario the one that leads to the partial advice case described in
RETURN VALUE? If yes, perhaps I should add some words to make that clearer.

You can see the light edits that I made in
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=e3ce016472a1b3ec5dffdeb23c98b9fef618a97b
and following that I restructured DESCRIPTION a little in
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=3aac0708a9acee5283e091461de6a8410bc921a6

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-01-30 Thread Michael Kerrisk (man-pages)
Hello Suren,

Thank you for the revisions! Just a few more comments: all pretty small
stuff (many points that I overlooked the first time rround), since the
page already looks pretty good by now.

Again, thanks for the rendered version. As before, I've added my
comments to the page source.

On 1/29/21 8:03 AM, Suren Baghdasaryan wrote:
> Initial version of process_madvise(2) manual page. Initial text was
> extracted from [1], amended after fix [2] and more details added using
> man pages of madvise(2) and process_vm_read(2) as examples. It also
> includes the changes to required permission proposed in [3].
> 
> [1] https://lore.kernel.org/patchwork/patch/1297933/
> [2] https://lkml.org/lkml/2020/12/8/1282
> [3] 
> https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
> changes in v2:
> - Changed description of MADV_COLD per Michal Hocko's suggestion
> - Appled fixes suggested by Michael Kerrisk
> 
> NAME
> process_madvise - give advice about use of memory to a process

s/-/\-/

> 
> SYNOPSIS
> #include 
> 
> ssize_t process_madvise(int pidfd,
>const struct iovec *iovec,
>unsigned long vlen,
>int advice,
>unsigned int flags);
> 
> DESCRIPTION
> The process_madvise() system call is used to give advice or directions
> to the kernel about the address ranges of other process as well as of
> the calling process. It provides the advice to address ranges of process
> described by iovec and vlen. The goal of such advice is to improve system
> or application performance.
> 
> The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
> specifies the process to which the advice is to be applied.
> 
> The pointer iovec points to an array of iovec structures, defined in
>  as:
> 
> struct iovec {
> void  *iov_base;/* Starting address */
> size_t iov_len; /* Number of bytes to transfer */
> };
> 
> The iovec structure describes address ranges beginning at iov_base address
> and with the size of iov_len bytes.
> 
> The vlen represents the number of elements in the iovec structure.
> 
> The advice argument is one of the values listed below.
> 
>   Linux-specific advice values
> The following Linux-specific advice values have no counterparts in the
> POSIX-specified posix_madvise(3), and may or may not have counterparts
> in the madvise(2) interface available on other implementations.
> 
> MADV_COLD (since Linux 5.4.1)
> Deactive a given range of pages which will make them a more probable
> reclaim target should there be a memory pressure. This is a non-
> destructive operation. The advice might be ignored for some pages in
> the range when it is not applicable.
> 
> MADV_PAGEOUT (since Linux 5.4.1)
> Reclaim a given range of pages. This is done to free up memory 
> occupied
> by these pages. If a page is anonymous it will be swapped out. If a
> page is file-backed and dirty it will be written back to the backing
> storage. The advice might be ignored for some pages in the range when
> it is not applicable.
> 
> The flags argument is reserved for future use; currently, this argument
> must be specified as 0.
> 
> The value specified in the vlen argument must be less than or equal to
> IOV_MAX (defined in  or accessible via the call
> sysconf(_SC_IOV_MAX)).
> 
> The vlen and iovec arguments are checked before applying any hints. If
> the vlen is too big, or iovec is invalid, an error will be returned
> immediately.
> 
> The hint might be applied to a part of iovec if one of its elements points
> to an invalid memory region in the remote process. No further elements 
> will
> be processed beyond that point.
> 
> Permission to provide a hint to another process is governed by a ptrace
> access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition,
> the caller must have the CAP_SYS_ADMIN capability due to performance
> implications of applying the hint.
> 
> RETURN VALUE
> On success, process_madvise() returns the number of bytes advised. This
> return value may be less than the total number of requested bytes, if an
> error occurred after some iovec elements were already processed. The 
> caller
> should check the return value to determine whether a partial advice
> occurred.
> 
> On error, -1 is returned and errno is set to indicate the error.
> 
> ERRORS
> EFAULT The memory described by iovec is outside the accessible address
>space of the process referred to by pidfd.
> EINVAL flags is not 0.
> EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
> EINVAL vlen is too large.
> 

Re: [PATCH v6] close_range.2: new page documenting close_range(2)

2021-01-28 Thread Michael Kerrisk (man-pages)
Hello Stephen, (and CHristian, please!)


Thanks for your patch revision. I've merged it, and have
done some light editing, but I still have a question:

On 1/23/21 5:11 PM, Stephen Kitt wrote:

[...]

> +.SH ERRORS

> +.TP
> +.B EMFILE
> +The per-process limit on the number of open file descriptors has been reached
> +(see the description of
> +.B RLIMIT_NOFILE
> +in
> +.BR getrlimit (2)).

I think there was already a question about this error, but
I still have a doubt.

A glance at the code tells me that indeed EMFILE can occur.
But how can the reason be because the limit on the number
of open file descriptors has been reached? I mean: no new
FDs are being opened, so how can we go over the limit. I think
the cause of this error is something else, but what is it?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Michael Kerrisk (man-pages)
Hello Suren,

On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
>  wrote:
>>
>> Hello Suren,
>>
>> Thank you for writing this page! Some comments below.
> 
> Thanks for the review!
> Couple questions below and I'll respin the new version once they are 
> clarified.

Okay. See below.

>> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan  wrote:
>>>

[...]

Thanks for all the acks. That let's me know that you saw what I said.

>>> RETURN VALUE
>>> On success, process_madvise() returns the number of bytes advised. This
>>> return value may be less than the total number of requested bytes, if an
>>> error occurred. The caller should check return value to determine 
>>> whether
>>> a partial advice occurred.
>>
>> So there are three return values possible,
> 
> Ok, I think I see your point. How about this instead:

Well, I'm glad you saw it, because I forgot to finish it. But yes,
you understood what I forgot to say.

> RETURN VALUE
>  On success, process_madvise() returns the number of bytes advised. This
>  return value may be less than the total number of requested bytes, if an
>  error occurred after some iovec elements were already processed. The 
> caller
>  should check the return value to determine whether a partial
> advice occurred.
> 
> On error, -1 is returned and errno is set appropriately.

We recently standardized some wording here:
s/appropriately/to indicate the error/.


>>> +.PP
>>> +The pointer
>>> +.I iovec
>>> +points to an array of iovec structures, defined in
>>
>> "iovec" should be formatted as
>>
>> .I iovec
> 
> I think it is formatted that way above. What am I missing?

But also in "an array of iovec structures"...

> BTW, where should I be using .I vs .IR? I was looking for an answer
> but could not find it.

.B / .I == bold/italic this line
.BR / .IR == alternate bold/italic with normal (Roman) font.

So:
.I iovec
.I iovec ,   # so that comma is not italic
.BR process_madvise ()
etc.

[...]

>>> +.I iovec
>>> +if one of its elements points to an invalid memory
>>> +region in the remote process. No further elements will be
>>> +processed beyond that point.
>>> +.PP
>>> +Permission to provide a hint to external process is governed by a
>>> +ptrace access mode
>>> +.B PTRACE_MODE_READ_REALCREDS
>>> +check; see
>>> +.BR ptrace (2)
>>> +and
>>> +.B CAP_SYS_ADMIN
>>> +capability that caller should have in order to affect performance
>>> +of an external process.
>>
>> The preceding sentence is garbled. Missing words?
> 
> Maybe I worded it incorrectly. What I need to say here is that the
> caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> tried adding the second one to it, obviously unsuccessfully. Any
> advice on how to fix that?

I think you already got pretty close. How about:

[[
Permission to provide a hint to another process is governed by a
ptrace access mode
.B PTRACE_MODE_READ_REALCREDS
check (see
BR ptrace (2));
in addition, the caller must have the
.B CAP_SYS_ADMIN
capability.
]]

[...]

>>> +.TP
>>> +.B ESRCH
>>> +No process with ID
>>> +.I pidfd
>>> +exists.
>>
>> Should this maybe be:
>> [[
>> The target process does not exist (i.e., it has terminated and
>> been waited on).
>> ]]
>>
>> See pidfd_send_signal(2).
> 
> I "borrowed" mine from
> https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> either one sounds good to me. Maybe for pidfd_send_signal the wording
> about termination is more important. Anyway, it's up to you. Just let
> me know which one to use.

I think the pidfd_send_signal(2) wording fits better.

[...]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Michael Kerrisk (man-pages)
Hello Suren,

Thank you for writing this page! Some comments below.

On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan  wrote:
>
> Initial version of process_madvise(2) manual page. Initial text was
> extracted from [1], amended after fix [2] and more details added using
> man pages of madvise(2) and process_vm_read(2) as examples. It also
> includes the changes to required permission proposed in [3].
>
> [1] https://lore.kernel.org/patchwork/patch/1297933/
> [2] https://lkml.org/lkml/2020/12/8/1282
> [3] 
> https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
>
> Signed-off-by: Suren Baghdasaryan 
> Signed-off-by: Minchan Kim 
> ---
>
> Adding the plane text version for ease of review:

Thanks for adding the rendered version. I will make my comments
against the source, below.

> NAME
> process_madvise - give advice about use of memory to a process
>
> SYNOPSIS
> #include 
>
> ssize_t process_madvise(int pidfd,
>const struct iovec *iovec,
>unsigned long vlen,
>int advice,
>unsigned int flags);
>
> DESCRIPTION
> The process_madvise() system call is used to give advice or directions to
> the kernel about the address ranges from external process as well as local
> process. It provides the advice to address ranges of process described by
> iovec and vlen. The goal of such advice is to improve system or 
> application
> performance.
>
> The pidfd selects the process referred to by the PID file descriptor
> specified in pidfd. (see pidofd_open(2) for further information).
>
> The pointer iovec points to an array of iovec structures, defined in
>  as:
>
> struct iovec {
> void  *iov_base;/* Starting address */
> size_t iov_len; /* Number of bytes to transfer */
> };
>
> The iovec describes address ranges beginning at iov_base address and with
> the size of iov_len bytes.
>
> The vlen represents the number of elements in iovec.
>
> The advice can be one of the values listed below.
>
>   Linux-specific advice values
> The following Linux-specific advice values have no counterparts in the
> POSIX-specified posix_madvise(3), and may or may not have counterparts in
> the madvise() interface available on other implementations.
>
> MADV_COLD (since Linux 5.4.1)
> Deactivate a given range of pages by moving them from active to
> inactive LRU list. This is done to accelerate the reclaim of these
> pages. The advice might be ignored for some pages in the range when it
> is not applicable.
> MADV_PAGEOUT (since Linux 5.4.1)
> Reclaim a given range of pages. This is done to free up memory 
> occupied
> by these pages. If a page is anonymous it will be swapped out. If a
> page is file-backed and dirty it will be written back into the backing
> storage. The advice might be ignored for some pages in the range when
> it is not applicable.
>
> The flags argument is reserved for future use; currently, this argument 
> must
> be specified as 0.
>
> The value specified in the vlen argument must be less than or equal to
> IOV_MAX (defined in  or accessible via the call
> sysconf(_SC_IOV_MAX)).
>
> The vlen and iovec arguments are checked before applying any hints. If the
> vlen is too big, or iovec is invalid, an error will be returned
> immediately.
>
> Hint might be applied to a part of iovec if one of its elements points to
> an invalid memory region in the remote process. No further elements will 
> be
> processed beyond that point.
>
> Permission to provide a hint to external process is governed by a ptrace
> access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2) and
> CAP_SYS_ADMIN capability that caller should have in order to affect
> performance of an external process.
>
> RETURN VALUE
> On success, process_madvise() returns the number of bytes advised. This
> return value may be less than the total number of requested bytes, if an
> error occurred. The caller should check return value to determine whether
> a partial advice occurred.

So there are three return values possible,
> ERRORS
> EFAULT The memory described by iovec is outside the accessible address
>space of the process pid.

s/pid/
of the process referred to by
.IR pidfd .

> EINVAL flags is not 0.
> EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
> EINVAL vlen is too large.
> ENOMEM Could not allocate memory for internal copies of the iovec
>structures.
> EPERM The caller does not have permission to access the address space of
>   the process pidfd.
> ESRCH No process with ID pidfd exists.
>
> VERSIONS
> Since Linux 5.10, support for this system call 

Re: [PATCH 5/5] Add manpage for fsconfig(2)

2021-01-22 Thread Michael Kerrisk (man-pages)
Hello David,

Ping!

Thanks,

Michael

On Mon, 24 Aug 2020 at 14:25, David Howells  wrote:
>
> Add a manual page to document the fsconfig() system call.
>
> Signed-off-by: David Howells 
> ---
>
>  man2/fsconfig.2 |  277 
> +++
>  1 file changed, 277 insertions(+)
>  create mode 100644 man2/fsconfig.2
>
> diff --git a/man2/fsconfig.2 b/man2/fsconfig.2
> new file mode 100644
> index 0..da53d2fcb
> --- /dev/null
> +++ b/man2/fsconfig.2
> @@ -0,0 +1,277 @@
> +'\" t
> +.\" Copyright (c) 2020 David Howells 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH FSCONFIG 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +fsconfig \- Filesystem parameterisation
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.B #include 
> +.B #include 
> +.B #include 
> +.PP
> +.BI "int fsconfig(int *" fd ", unsigned int " cmd ", const char *" key ,
> +.br
> +.BI " const void __user *" value ", int " aux ");"
> +.br
> +.BI
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call.
> +.SH DESCRIPTION
> +.PP
> +.BR fsconfig ()
> +is used to supply parameters to and issue commands against a filesystem
> +configuration context as set up by
> +.BR fsopen (2)
> +or
> +.BR fspick (2).
> +The context is supplied attached to the file descriptor specified by
> +.I fd
> +argument.
> +.PP
> +The
> +.I cmd
> +argument indicates the command to be issued, where some of the commands 
> simply
> +supply parameters to the context.  The meaning of
> +.IR key ", " value " and " aux
> +are command-dependent; unless required for the command, these should be set 
> to
> +NULL or 0.
> +.PP
> +The available commands are:
> +.TP
> +.B FSCONFIG_SET_FLAG
> +Set the parameter named by
> +.IR key
> +to true.  This may fail with error
> +.B EINVAL
> +if the parameter requires an argument.
> +.TP
> +.B FSCONFIG_SET_STRING
> +Set the parameter named by
> +.I key
> +to a string.  This may fail with error
> +.B EINVAL
> +if the parser doesn't want a parameter here, wants a non-string or the string
> +cannot be interpreted appropriately.
> +.I value
> +points to a NUL-terminated string.
> +.TP
> +.B FSCONFIG_SET_BINARY
> +Set the parameter named by
> +.I key
> +to be a binary blob argument.  This may cause
> +.B EINVAL
> +to be returned if the filesystem parser isn't expecting a binary blob and it
> +can't be converted to something usable.
> +.I value
> +points to the data and
> +.I aux
> +indicates the size of the data.
> +.TP
> +.B FSCONFIG_SET_PATH
> +Set the parameter named by
> +.I key
> +to the object at the provided path.
> +.I value
> +should point to a NUL-terminated pathname string and aux may indicate
> +.B AT_FDCWD
> +or a file descriptor indicating a directory from which to begin a relative
> +path resolution.  This may fail with error
> +.B EINVAL
> +if the parameter isn't expecting a path; it may also fail if the path cannot
> +be resolved with the typcal errors for that
> +.RB "(" ENOENT ", " ENOTDIR ", " EPERM ", " EACCES ", etc.)."
> +.IP
> +Note that FSCONFIG_SET_STRING can be used instead, implying AT_FDCWD.
> +.TP
> +.B FSCONFIG_SET_PATH_EMPTY
> +As FSCONFIG_SET_PATH, but with
> +.B AT_EMPTY_PATH
> +applied to the pathwalk.
> +.TP
> +.B FSCONFIG_SET_FD
> +Set the parameter named by
> +.I key
> +to the file descriptor specified by
> +.IR aux .
> +This will fail with
> +.B EINVAL
> +if the parameter doesn't expect a file descriptor or
> +.B EBADF
> +if the file descriptor is invalid.
> +.IP
> +Note that FSCONFIG_SET_STRING can be used instead with the file descriptor
> +passed as a decimal string.
> +.TP
> +.B FSCONFIG_CMD_CREATE
> +This command triggers the filesystem to take the parameters set in the 
> context
> +and to try to create filesystem representation in the kernel.  If an existing
> +representation can be shared, the filesystem may do 

Re: [PATCH 3/5] Add manpage for fspick(2)

2021-01-22 Thread Michael Kerrisk (man-pages)
Hello David,

Ping!

Thanks,

Michael

On Mon, 24 Aug 2020 at 14:25, David Howells  wrote:
>
> Add a manual page to document the fspick() system call.
>
> Signed-off-by: David Howells 
> ---
>
>  man2/fspick.2 |  180 
> +
>  1 file changed, 180 insertions(+)
>  create mode 100644 man2/fspick.2
>
> diff --git a/man2/fspick.2 b/man2/fspick.2
> new file mode 100644
> index 0..72bf645dd
> --- /dev/null
> +++ b/man2/fspick.2
> @@ -0,0 +1,180 @@
> +'\" t
> +.\" Copyright (c) 2020 David Howells 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH FSPICK 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +fspick \- Select filesystem for reconfiguration
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.B #include 
> +.B #include 
> +.BR "#include" "/* Definition of AT_* constants */"
> +.PP
> +.BI "int fspick(int " dirfd ", const char *" pathname ", unsigned int " 
> flags );
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call.
> +.SH DESCRIPTION
> +.PP
> +.BR fspick ()
> +creates a new filesystem configuration context within the kernel and 
> attaches a
> +pre-existing superblock to it so that it can be reconfigured (similar to
> +.BR mount (8)
> +with the "-o remount" option).  The configuration context is marked as being 
> in
> +reconfiguration mode and attached to a file descriptor, which is returned to
> +the caller.  The file descriptor can be marked close-on-exec by setting
> +.B FSPICK_CLOEXEC
> +in
> +.IR flags .
> +.PP
> +The target is whichever superblock backs the object determined by
> +.IR dfd ", " pathname " and " flags .
> +The following can be set in
> +.I flags
> +to control the pathwalk to that object:
> +.TP
> +.B FSPICK_SYMLINK_NOFOLLOW
> +Don't follow symbolic links in the final component of the path.
> +.TP
> +.B FSPICK_NO_AUTOMOUNT
> +Don't follow automounts in the final component of the path.
> +.TP
> +.B FSPICK_EMPTY_PATH
> +Allow an empty string to be specified as the pathname.  This allows
> +.I dirfd
> +to specify the target mount exactly.
> +.PP
> +After calling fspick(), the file descriptor should be passed to the
> +.BR fsconfig (2)
> +system call, using that to specify the desired changes to filesystem and
> +security parameters.
> +.PP
> +When the parameters are all set, the
> +.BR fsconfig ()
> +system call should then be called again with
> +.B FSCONFIG_CMD_RECONFIGURE
> +as the command argument to effect the reconfiguration.
> +.PP
> +After the reconfiguration has taken place, the context is wiped clean (apart
> +from the superblock attachment, which remains) and can be reused to make
> +another reconfiguration.
> +.PP
> +The file descriptor also serves as a channel by which more comprehensive 
> error,
> +warning and information messages may be retrieved from the kernel using
> +.BR read (2).
> +.SS Message Retrieval Interface
> +The context file descriptor may be queried for message strings at any time by
> +calling
> +.BR read (2)
> +on the file descriptor.  This will return formatted messages that are 
> prefixed
> +to indicate their class:
> +.TP
> +\fB"e "\fP
> +An error message string was logged.
> +.TP
> +\fB"i "\fP
> +An informational message string was logged.
> +.TP
> +\fB"w "\fP
> +An warning message string was logged.
> +.PP
> +Messages are removed from the queue as they're read and the queue has a 
> limited
> +depth of 8 messages, so it's possible for some to get lost.
> +.SH RETURN VALUE
> +On success, the function returns a file descriptor.  On error, \-1 is 
> returned,
> +and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +The error values given below result from filesystem type independent errors.
> +Additionally, each filesystem type may have its own special errors and its 
> own
> +special behavior.  See the Linux kernel source code for 

Re: [PATCH 4/5] Add manpage for fsopen(2) and fsmount(2)

2021-01-22 Thread Michael Kerrisk (man-pages)
Hello David,

Ping!

Thanks,

Michael

On Fri, 16 Oct 2020 at 08:50, Michael Kerrisk (man-pages)
 wrote:
>
> Hi David,
>
> Another ping for these five patches please!
>
> Cheers,
>
> Michael
>
> On Fri, 11 Sep 2020 at 14:44, Michael Kerrisk (man-pages)
>  wrote:
> >
> > Hi David,
> >
> > A ping for these five patches please!
> >
> > Cheers,
> >
> > Michael
> >
> > On Wed, 2 Sep 2020 at 22:14, Michael Kerrisk (man-pages)
> >  wrote:
> > >
> > > On Wed, 2 Sep 2020 at 18:14, David Howells  wrote:
> > > >
> > > > Michael Kerrisk (man-pages)  wrote:
> > > >
> > > > > The term "filesystem configuration context" is introduced, but never
> > > > > really explained. I think it would be very helpful to have a sentence
> > > > > or three that explains this concept at the start of the page.
> > > >
> > > > Does that need a .7 manpage?
> > >
> > > I was hoping a sentence or a paragraph in this page might suffice. Do
> > > you think more is required?
> > >
> > > Cheers,
> > >
> > > Michael
> > >
> > > --
> > > Michael Kerrisk
> > > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > > Linux/UNIX System Programming Training: http://man7.org/training/
> >
> >
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 2/5] Add manpages for move_mount(2)

2021-01-22 Thread Michael Kerrisk (man-pages)
Hello David,

Ping!

Thanks,

Michael


On Mon, 24 Aug 2020 at 14:24, David Howells  wrote:
>
> Add manual pages to document the move_mount() system call.
>
> Signed-off-by: David Howells 
> ---
>
>  man2/move_mount.2 |  267 
> +
>  1 file changed, 267 insertions(+)
>  create mode 100644 man2/move_mount.2
>
> diff --git a/man2/move_mount.2 b/man2/move_mount.2
> new file mode 100644
> index 0..2ceb775d9
> --- /dev/null
> +++ b/man2/move_mount.2
> @@ -0,0 +1,267 @@
> +'\" t
> +.\" Copyright (c) 2020 David Howells 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH MOVE_MOUNT 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +move_mount \- Move mount objects around the filesystem topology
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.B #include 
> +.B #include 
> +.BR "#include" "/* Definition of AT_* constants */"
> +.PP
> +.BI "int move_mount(int " from_dirfd ", const char *" from_pathname ","
> +.BI "   int " to_dirfd ", const char *" to_pathname ","
> +.BI "   unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call.
> +.SH DESCRIPTION
> +The
> +.BR move_mount ()
> +call moves a mount from one place to another; it can also be used to attach 
> an
> +unattached mount that was created by
> +.BR fsmount "() or " open_tree "() with " OPEN_TREE_CLONE .
> +.PP
> +If
> +.BR move_mount ()
> +is called repeatedly with a file descriptor that refers to a mount object,
> +then the object will be attached/moved the first time and then moved
> +repeatedly, detaching it from the previous mountpoint each time.
> +.PP
> +To access the source mount object or the destination mountpoint, no
> +permissions are required on the object itself, but if either pathname is
> +supplied, execute (search) permission is required on all of the directories
> +specified in
> +.IR from_pathname " or " to_pathname .
> +.PP
> +The caller does, however, require the appropriate privilege (Linux: the
> +.B CAP_SYS_ADMIN
> +capability) to move or attach mounts.
> +.PP
> +.BR move_mount ()
> +uses
> +.IR from_pathname ", " from_dirfd " and part of " flags
> +to locate the mount object to be moved and
> +.IR to_pathname ", " to_dirfd " and another part of " flags
> +to locate the destination mountpoint.  Each lookup can be done in one of a
> +variety of ways:
> +.TP
> +[*] By absolute path.
> +The pathname points to an absolute path and the dirfd is ignored.  The file 
> is
> +looked up by name, starting from the root of the filesystem as seen by the
> +calling process.
> +.TP
> +[*] By cwd-relative path.
> +The pathname points to a relative path and the dirfd is
> +.IR AT_FDCWD .
> +The file is looked up by name, starting from the current working directory.
> +.TP
> +[*] By dir-relative path.
> +The pathname points to relative path and the dirfd indicates a file 
> descriptor
> +pointing to a directory.  The file is looked up by name, starting from the
> +directory specified by
> +.IR dirfd .
> +.TP
> +[*] By file descriptor.  The pathname is an empty string (""), the dirfd
> +points directly to the mount object to move or the destination mount point 
> and
> +the appropriate
> +.B *_EMPTY_PATH
> +flag is set.
> +.PP
> +.I flags
> +can be used to influence a path-based lookup.  The value for
> +.I flags
> +is constructed by OR'ing together zero or more of the following constants:
> +.TP
> +.BR MOVE_MOUNT_F_EMPTY_PATH
> +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> +If
> +.I from_pathname
> +is an empty string, operate on the file referred to by
> +.IR from_dirfd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag or
> +.BR open_tree ())
> +If
> +.I from_dirfd
> +is
> +.BR AT_FDCWD ,
> +the call operates on the current working directory.
> +In this case,
> +.I 

Re: [PATCH 1/5] Add manpage for open_tree(2)

2021-01-22 Thread Michael Kerrisk (man-pages)
Hello David,

Ping!

Thanks,

Michael

On Thu, 27 Aug 2020 at 13:01, Michael Kerrisk (man-pages)
 wrote:
>
> Hello David,
>
> Can I ask that you please reply to each of my mails, rather than
> just sending out a new patch series (which of course I would also
> like  you to do). Some things that I mentioned in the last mails
> got lost, and I end up having to repeat them.
>
> So, even where I say "please change this", could you please reply with
> "done", or a reason why you declined the suggested change, is useful.
> But in any case, a few words in reply to explain the other changes
> that you make would be helpful.
>
> Also, some of my questions now will get a little more complex, and as
> well as you updating the pages, I think a little discussion may be
> required in some cases.
>
> On 8/24/20 2:24 PM, David Howells wrote:
> > Add a manual page to document the open_tree() system call.
> >
> > Signed-off-by: David Howells 
> > ---
> >
> >  man2/open_tree.2 |  249 
> > ++
> >  1 file changed, 249 insertions(+)
> >  create mode 100644 man2/open_tree.2
> >
> > diff --git a/man2/open_tree.2 b/man2/open_tree.2
> > new file mode 100644
> > index 0..d480bd82f
> > --- /dev/null
> > +++ b/man2/open_tree.2
> > @@ -0,0 +1,249 @@
> > +'\" t
> > +.\" Copyright (c) 2020 David Howells 
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH OPEN_TREE 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +open_tree \- Pick or clone mount object and attach to fd
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include 
> > +.B #include 
> > +.B #include 
> > +.BR "#include" "/* Definition of AT_* constants */"
> > +.PP
> > +.BI "int open_tree(int " dirfd ", const char *" pathname ", unsigned int " 
> > flags );
> > +.fi
> > +.PP
> > +.IR Note :
> > +There are no glibc wrappers for these system calls.
> > +.SH DESCRIPTION
> > +.BR open_tree ()
> > +picks the mount object specified by the pathname and attaches it to a new 
> > file
>
> The terminology "pick" is unusual, and you never really explain what
> it means.  Is there better terminology? In any case, can you add a few
> words to explain what the term (('pick" or whatever alternative you
> come up with) means.
>
> > +descriptor or clones it and attaches the clone to the file descriptor.  The
>
> Please replace "it" by a noun (phrase) -- maybe: "the mount object"?
>
> > +resultant file descriptor is indistinguishable from one produced by
> > +.BR open "(2) with " O_PATH .
>
> What is the significance of that last piece? Can you add some words
> about why the fact that the resulting FD is indistinguishable from one
> produced by open() O_PATH matters or is useful?
>
> > +.PP
> > +In the case that the mount object is cloned, the clone will be "unmounted" 
> > and
>
> You place "unmounted" in quotes. Why? Is this to signify that the the
> unmount is somehow different from other unmounts? If so, please
> explain how it is different.  If not, then I think we can lose the double
&g

Re: [PATCH v27 12/12] landlock: Add user and kernel documentation

2021-01-22 Thread Michael Kerrisk (man-pages)
Hello Mickaël,

It would be great to have some manual pages for these system calls
before release... Can you prepare something?

Thanks,

Michael

On Thu, 21 Jan 2021 at 21:51, Mickaël Salaün  wrote:
>
> From: Mickaël Salaün 
>
> This documentation can be built with the Sphinx framework.
>
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 
> Reviewed-by: Vincent Dagonneau 
> ---
>
> Changes since v25:
> * Explain the behavior of layered access rights.
> * Explain how bind mounts and overayfs mounts are handled by Landlock:
>   merged overlayfs mount points have their own inodes, which makes these
>   hierarchies independent from its upper and lower layers, unlike bind
>   mounts which share the same inodes between the source hierarchy and
>   the mount point hierarchy.
>   New overlayfs mount and bind mount tests check these behaviors.
> * Synchronize with the new syscalls.c file and update syscall names.
> * Fix spelling.
> * Remove Reviewed-by Jann Horn because of the above changes.
>
> Changes since v24:
> * Add Reviewed-by Jann Horn.
> * Add a paragraph to explain how the ruleset layers work.
> * Bump date.
>
> Changes since v23:
> * Explain limitations for the maximum number of stacked ruleset, and the
>   memory usage restrictions.
>
> Changes since v22:
> * Fix spelling and remove obsolete sentence (spotted by Jann Horn).
> * Bump date.
>
> Changes since v21:
> * Move the user space documentation to userspace-api/landlock.rst and
>   the kernel documentation to security/landlock.rst .
> * Add license headers.
> * Add last update dates.
> * Update MAINTAINERS file.
> * Add (back) links to git.kernel.org .
> * Fix spelling.
>
> Changes since v20:
> * Update examples and documentation with the new syscalls.
>
> Changes since v19:
> * Update examples and documentation with the new syscalls.
>
> Changes since v15:
> * Add current limitations.
>
> Changes since v14:
> * Fix spelling (contributed by Randy Dunlap).
> * Extend documentation about inheritance and explain layer levels.
> * Remove the use of now-removed access rights.
> * Use GitHub links.
> * Improve kernel documentation.
> * Add section for tests.
> * Update example.
>
> Changes since v13:
> * Rewrote the documentation according to the major revamp.
>
> Previous changes:
> https://lore.kernel.org/lkml/20191104172146.30797-8-...@digikod.net/
> ---
>  Documentation/security/index.rst |   1 +
>  Documentation/security/landlock.rst  |  79 ++
>  Documentation/userspace-api/index.rst|   1 +
>  Documentation/userspace-api/landlock.rst | 306 +++
>  MAINTAINERS  |   2 +
>  5 files changed, 389 insertions(+)
>  create mode 100644 Documentation/security/landlock.rst
>  create mode 100644 Documentation/userspace-api/landlock.rst
>
> diff --git a/Documentation/security/index.rst 
> b/Documentation/security/index.rst
> index 8129405eb2cc..16335de04e8c 100644
> --- a/Documentation/security/index.rst
> +++ b/Documentation/security/index.rst
> @@ -16,3 +16,4 @@ Security Documentation
> siphash
> tpm/index
> digsig
> +   landlock
> diff --git a/Documentation/security/landlock.rst 
> b/Documentation/security/landlock.rst
> new file mode 100644
> index ..244e616d3d7a
> --- /dev/null
> +++ b/Documentation/security/landlock.rst
> @@ -0,0 +1,79 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. Copyright © 2017-2020 Mickaël Salaün 
> +.. Copyright © 2019-2020 ANSSI
> +
> +==
> +Landlock LSM: kernel documentation
> +==
> +
> +:Author: Mickaël Salaün
> +:Date: January 2021
> +
> +Landlock's goal is to create scoped access-control (i.e. sandboxing).  To
> +harden a whole system, this feature should be available to any process,
> +including unprivileged ones.  Because such process may be compromised or
> +backdoored (i.e. untrusted), Landlock's features must be safe to use from the
> +kernel and other processes point of view.  Landlock's interface must 
> therefore
> +expose a minimal attack surface.
> +
> +Landlock is designed to be usable by unprivileged processes while following 
> the
> +system security policy enforced by other access control mechanisms (e.g. DAC,
> +LSM).  Indeed, a Landlock rule shall not interfere with other access-controls
> +enforced on the system, only add more restrictions.
> +
> +Any user can enforce Landlock rulesets on their processes.  They are merged 
> and
> +evaluated according to the inherited ones in a way that ensures that only 
> more
> +constraints can be added.
> +
> +User space documentation can be found here: :doc:`/userspace-api/landlock`.
> +
> +Guiding principles for safe access controls
> +===
> +
> +* A Landlock rule shall be focused on access control on kernel objects 
> instead
> +  of syscall filtering (i.e. syscall arguments), which is the purpose of
> +  seccomp-bpf.
> 

Re: [PATCH] entry: Use different define for selector variable in SUD

2021-01-20 Thread Michael Kerrisk (man-pages)
Hello all,

On Sat, 2 Jan 2021 at 00:55, Gabriel Krisman Bertazi
 wrote:
>
> Michael Kerrisk suggested that, from an API perspective, it is a bad
> idea to share the PR_SYS_DISPATCH_ defines between the prctl operation
> and the selector variable.  Therefore, define two new constants to be
> used by SUD's selector variable, and the corresponding documentation.
>
> While this changes the API, it is backward compatible, as the values
> remained the same and the old defines are still in place.  In addition,
> SUD has never been part of a Linux release, it will show up for the
> first time in 5.11.

Would it be possible to get this patch applied before 5.11 is released please?

To add some background, while reviewing a patch that Gabriel wrote to
to document this feature, I encountered a confusion that I'm sure many
others would encounter also. My initial comments were

[[
The value of arg2 can be either PR_SYS_DISPATCH_ON or
PR_SYS_DISPATCH_OFF. The value of the selector pointed to by
arg5 can likewise be R_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.
What is the relationship between these two attributes? For example,
what does it mean if arg2 is P R_SYS_DISPATCH_ON and, at the time of
the prctl() call, the selector has the value PR_SYS_DISPATCH_OFF?
]]

The issue is that the same names are being used in two parts of the
API with *different* meanings:

1. Define/clear SUD/the non-SUD memory region
2. Enable/disable SUD filtering in the SUD memory region (i.e., the
part of the virtual address space outside the region defined in 1).

In API design terms this feels wrong and is confusing. The numeric
values don't need to change (so there are no ABI changes implied, and
anyway this is a new feature in 5.11), but different names should be
used in the two parts of the API, as is fixed in this patch by
Gabriel.

Acked-my: Michael Kerrisk 

Cheers,

Michael



> Cc: Linux API 
> Suggested-by: Michael Kerrisk (man-pages) 
> Signed-off-by: Gabriel Krisman Bertazi 
> ---
>  .../admin-guide/syscall-user-dispatch.rst  |  4 ++--
>  include/uapi/linux/prctl.h |  2 ++
>  kernel/entry/syscall_user_dispatch.c   |  4 ++--
>  .../syscall_user_dispatch/sud_benchmark.c  |  8 +---
>  .../selftests/syscall_user_dispatch/sud_test.c | 14 --
>  5 files changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst 
> b/Documentation/admin-guide/syscall-user-dispatch.rst
> index a380d6515774..fc13112e36e3 100644
> --- a/Documentation/admin-guide/syscall-user-dispatch.rst
> +++ b/Documentation/admin-guide/syscall-user-dispatch.rst
> @@ -70,8 +70,8 @@ trampoline code on the vDSO, that trampoline is never 
> intercepted.
>  [selector] is a pointer to a char-sized region in the process memory
>  region, that provides a quick way to enable disable syscall redirection
>  thread-wide, without the need to invoke the kernel directly.  selector
> -can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
> -value should terminate the program with a SIGSYS.
> +can be set to PR_SYS_DISPATCH_FILTER_ALLOW or PR_SYS_DISPATCH_FILTER_BLOCK.
> +Any other value should terminate the program with a SIGSYS.
>
>  Security Notes
>  --
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 90deb41c8a34..a66c9fe41249 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -251,5 +251,7 @@ struct prctl_mm_map {
>  #define PR_SET_SYSCALL_USER_DISPATCH   59
>  # define PR_SYS_DISPATCH_OFF   0
>  # define PR_SYS_DISPATCH_ON1
> +# define PR_SYS_DISPATCH_FILTER_ALLOW  0
> +# define PR_SYS_DISPATCH_FILTER_BLOCK  1
>
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/entry/syscall_user_dispatch.c 
> b/kernel/entry/syscall_user_dispatch.c
> index b0338a5625d9..265c33b26dcf 100644
> --- a/kernel/entry/syscall_user_dispatch.c
> +++ b/kernel/entry/syscall_user_dispatch.c
> @@ -50,10 +50,10 @@ bool syscall_user_dispatch(struct pt_regs *regs)
> if (unlikely(__get_user(state, sd->selector)))
> do_exit(SIGSEGV);
>
> -   if (likely(state == PR_SYS_DISPATCH_OFF))
> +   if (likely(state == PR_SYS_DISPATCH_FILTER_ALLOW))
> return false;
>
> -   if (state != PR_SYS_DISPATCH_ON)
> +   if (state != PR_SYS_DISPATCH_FILTER_BLOCK)
> do_exit(SIGSYS);
> }
>
> diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c 
> b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
> index 6689f1183dbf..7617bd9ba6e1 100644
> --- a/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c

Re: [PATCH] getcpu.2: Document glibc wrapper instead of kernel syscall

2021-01-02 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 12/30/20 10:41 PM, Alejandro Colomar wrote:
> The glibc wrapper doesn't provide the third argument.
> Simplify the info about the (unused) kernel parameter
> to the minimum that is useful.
> 
> kernels <=2.6.23 are EOL since a long time ago.
> 
> The old info is commented out instead of removed.

I tend to be rather conservative about preserving historical
detail in the manual pages. Yes, 2.6.23 may be EOL from the
kernel community's point of view, but even in quite recent
times I've run into folk in the embedded world that who have
to at the very least support 2.6.* systems. So, as a general
principle, I'm inclined to retain the kind of info that this
patch removes. (I admit though that this is an extreme case:
historical behavior in a system call that is not frequently
used.)

There are exceptions. Occassionaly I run into historical 
info in manual pages that is clearly wrong, or incomplete.
In such cases, I am sometimes inclined to trim the details,
rather than invest the effort in working out all of the
historical details.

Clearly though, some fix is needed, since we now have 
a glibc wrapper that has just two arguments. I've applied
the patch below.

Cheers,

Michael

diff --git a/man2/getcpu.2 b/man2/getcpu.2
index a75123f97..59089bd74 100644
--- a/man2/getcpu.2
+++ b/man2/getcpu.2
@@ -14,10 +14,10 @@
 getcpu \- determine CPU and NUMA node on which the calling thread is running
 .SH SYNOPSIS
 .nf
-.B #include 
+.BR "#define _GNU_SOURCE" " /* See feature_test_macros(7) */"
+.B #include 
 .PP
-.BI "int getcpu(unsigned int *" cpu ", unsigned int *" node \
-", struct getcpu_cache *" tcache );
+.BI "int getcpu(unsigned int *" cpu ", unsigned int *" node );
 .fi
 .SH DESCRIPTION
 The
@@ -37,10 +37,6 @@ or
 .I node
 is NULL nothing is written to the respective pointer.
 .PP
-The third argument to this system call is nowadays unused,
-and should be specified as NULL
-unless portability to Linux 2.6.23 or earlier is required (see NOTES).
-.PP
 The information placed in
 .I cpu
 is guaranteed to be current only at the time of the call:
@@ -82,16 +78,31 @@ The intention of
 .BR getcpu ()
 is to allow programs to make optimizations with per-CPU data
 or for NUMA optimization.
+.\"
+.SS C library/kernel differences
+The kernel system call has a third argument:
+.PP
+.in +4n
+.nf
+.BI "int getcpu(unsigned int *" cpu ", unsigned int *" node ,
+.BI "   struct getcpu_cache *" tcache );
+.fi
+.in
 .PP
 The
 .I tcache
-argument is unused since Linux 2.6.24.
+argument is unused since Linux 2.6.24,
+and (when invoking the system call directly)
+should be specified as NULL,
+unless portability to Linux 2.6.23 or earlier is required.
+.PP
 .\" commit 4307d1e5ada595c87f9a4d16db16ba5edb70dcb1
 .\" Author: Ingo Molnar 
 .\" Date:   Wed Nov 7 18:37:48 2007 +0100
 .\" x86: ignore the sys_getcpu() tcache parameter
-In earlier kernels,
-if this argument was non-NULL,
+In Linux 2.6.23 and earlier, if the
+.I tcache
+argument was non-NULL,
 then it specified a pointer to a caller-allocated buffer in thread-local
 storage that was used to provide a caching mechanism for
 .BR getcpu ().

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v5] close_range.2: new page documenting close_range(2)

2020-12-22 Thread Michael Kerrisk (man-pages)
Hello Stephen,

Thank you for your revisions! I still have a few comments.

On 12/21/20 8:46 PM, Stephen Kitt wrote:
> This documents close_range(2) based on information in
> 278a5fbaed89dacd04e9d052f4594ffd0e0585de,
> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8, and
> 582f1fb6b721facf04848d2ca57f34468da1813e.
> 
> Signed-off-by: Stephen Kitt 
> ---
> V5: clarification of the open/close_range/execve sequence
> 
> V4: sort flags alphabetically
> move commit references inside the corresponding section
> more semantic newlines
> unformat numeric constants
> more formatting for function references
> escape C backslashes
> C99 loop indices
> 
> V3: fix synopsis overflow
> copy notes from membarrier.2 re the lack of wrapper
> semantic newlines
> drop non-standard "USE CASES" section heading
> add code example
> 
> V2: unsigned int to match the kernel declarations
> groff and grammar tweaks
> CLOSE_RANGE_UNSHARE unshares *and* closes
> Explain that EMFILE and ENOMEM can occur with C_R_U
> "Conforming to" phrasing
> Detailed explanation of CLOSE_RANGE_UNSHARE
> Reading /proc isn't common
> 
>  man2/close_range.2 | 267 +
>  1 file changed, 267 insertions(+)
>  create mode 100644 man2/close_range.2
> 
> diff --git a/man2/close_range.2 b/man2/close_range.2
> new file mode 100644
> index 0..0677a9bf9
> --- /dev/null
> +++ b/man2/close_range.2
> @@ -0,0 +1,267 @@
> +.\" Copyright (c) 2020 Stephen Kitt 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +close_range \- close all file descriptors in a given range
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.PP
> +.BI "int close_range(unsigned int " first ", unsigned int " last ,
> +.BI "unsigned int " flags );
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR close_range ()
> +system call closes all open file descriptors from
> +.I first
> +to
> +.I last
> +(included).
> +.PP
> +Errors closing a given file descriptor are currently ignored.
> +.PP
> +.I flags
> +can be 0 or set to one or both of the following:

Better, I think:
"flags is a bit mask containing 0 or more of the following:"

> +.TP
> +.BR CLOSE_RANGE_CLOEXEC " (since Linux 5.10)"

s/5.10/5.11/ ?

> +sets the close-on-exec bit instead of

s/close-on-exec bit/file descriptor's close-on-exec flag/

> +immediately closing the file descriptors.
> +.TP
> +.B CLOSE_RANGE_UNSHARE
> +unshares the range of file descriptors from any other processes,
> +before closing them,
> +avoiding races with other threads sharing the file descriptor table.
> +.SH RETURN VALUE
> +On success,
> +.BR close_range ()
> +returns 0.
> +On error, \-1 is returned and
> +.I errno
> +is set to indicate the cause of the error.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +.I flags
> +is not valid, or
> +.I first
> +is greater than
> +.IR last .
> +.PP
> +The following can occur with
> +.B CLOSE_RANGE_UNSHARE
> +(when constructing the new descriptor table):
> +.TP
> +.B EMFILE
> +The per-process limit on the number of open file descriptors has been reached
> +(see the description of
> +.B RLIMIT_NOFILE
> +in
> +.BR getrlimit (2)).
> +.TP
> +.B ENOMEM
> +Insufficient kernel memory was available.
> +.SH VERSIONS
> +.BR close_range ()
> +first appeared in Linux 5.9.
> +.SH CONFORMING TO
> +.BR close_range ()
> +is a nonstandard function that is also present on FreeBSD.
> +.SH NOTES
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +.SS Closing all open file descriptors
> +.\" 278a5fbaed89dacd04e9d052f4594ffd0e0585de
> +To avoid blindly closing file descriptors
> +in the range of possible file descriptors,
> +this is sometimes implemented 

man-pages-5.10 is released

2020-12-22 Thread Michael Kerrisk (man-pages)
Gidday,

For this release, Alejandro (Alex) Colomar has joined me
as a comaintainer and we are proud to announce:

man-pages-5.10 - man pages for Linux

This release resulted from patches, bug reports, reviews, and
comments from around 25 contributors. The release includes 
just over 150 commits that changed around 140 pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_5.10

A short summary of the release is blogged at:
https://linux-man-pages.blogspot.com/2020/12/man-pages-510-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers of LKML is shown below.

Cheers,

Michael

 Changes in man-pages-5.10 


Newly documented interfaces in existing pages
-

access.2
Michael Kerrisk
Document faccessat2()
faccessat2() was added in Linux 5.8 and enables a fix to
longstanding bugs in the faccessat() wrapper function.

membarrier.2
Peter Oskolkov  [Alejandro Colomar]
Update for Linux 5.10
Linux kernel commit 2a36ab717e8fe678d98f81c14a0b124712719840
(part of 5.10 release) changed sys_membarrier prototype/parameters
and added two new commands [MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
and MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ].

mount.2
statfs.2
Ross Zwisler
Add NOSYMFOLLOW flags to mount(2) and statfs(2)


Changes to individual pages
---

cacheflush.2
Alejandro Colomar
Document Architecture-specific variants
Alejandro Colomar  [Heinrich Schuchardt]
Document __builtin___clear_cache() as a more portable alternative

clone.2
sigaltstack.2
Michael Kerrisk
clone(CLONE_VM) disables the alternate signal stack

mmap.2
Michael Kerrisk
Clarify SIGBUS text and treatment of partial page at end of a mapping

perf_event_open.2
Namhyung Kim  [Alejandro Colomar]
Update man page with recent kernel changes

sigaltstack.2
Michael Kerrisk
Clarify that the alternate signal stack is per-thread

timer_getoverrun.2
Michael Kerrisk
timer_getoverrun() now clamps the overrun count to DELAYTIMER_MAX
See https://bugzilla.kernel.org/show_bug.cgi?id=12665.

uselib.2
posix_memalign.3
profil.3
rtime.3
Michael Kerrisk
Remove some text about libc/libc5
With this change, there remain almost no vestiges of information
about the long defunct Linux libc.

signal.7
Michael Kerrisk  [Heinrich Schuchardt, Dave Martin]
Add some details on the execution of signal handlers
Add a "big picture" of what happens when a signal handler
is invoked.

tcp.7
Alejandro Colomar  [Philip Rowlands]
tcp_syncookies: It is now an integer [0, 2]
Since Linux kernel 3.12, tcp_syncookies can have the value 2,
which sends out cookies unconditionally.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] close_range.2: new page documenting close_range(2)

2020-12-10 Thread Michael Kerrisk (man-pages)
On 12/10/20 1:24 AM, Alejandro Colomar (man-pages) wrote:
> Hi Stephen,
> 
> A few more comments below.
> 
> Michael, please have a look at them too.
> 
> Christian, do you have any program that you used to test the syscall
> that could be added as an example program to the page?
> 
> Thanks,
> 
> Alex
> 
> On 12/9/20 11:00 PM, Stephen Kitt wrote:
>> This documents close_range(2) based on information in
>> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
>> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
>>
>> Signed-off-by: Stephen Kitt 
>> ---
>> V2: unsigned int to match the kernel declarations
>> groff and grammar tweaks
>> CLOSE_RANGE_UNSHARE unshares *and* closes
>> Explain that EMFILE and ENOMEM can occur with C_R_U
>> "Conforming to" phrasing
>> Detailed explanation of CLOSE_RANGE_UNSHARE
>> Reading /proc isn't common
>>
>>  man2/close_range.2 | 138 +
>>  1 file changed, 138 insertions(+)
>>  create mode 100644 man2/close_range.2
>>
>> diff --git a/man2/close_range.2 b/man2/close_range.2
>> new file mode 100644
>> index 0..403142b33
>> --- /dev/null
>> +++ b/man2/close_range.2

[...]

>> +.SH USE CASES
> 
> This section is unconventional.  Please move that text to one of the
> traditional sections.  I think DESCRIPTION would be the best place for this.

Actually, I'd just drop this SH line, and keep the
subsections where they are in NOTES.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] close_range.2: new page documenting close_range(2)

2020-12-10 Thread Michael Kerrisk (man-pages)
On 12/9/20 10:47 AM, Alejandro Colomar (man-pages) wrote:

>>> +descriptors in
>>> +.B /proc/self/fd/
> 
> By reading proc.5, I think this should s/.B/.I/, right mtk?
> 
>>> +and calling
>>> +.BR close (2)
>>> +on each one.
>>> +.BR close_range ()
>>> +can take care of this without requiring
>>> +.B /proc
> 
> By reading proc.5, I think this should s/.B/.I/, right mtk?

Yes to both. Pathnames are formatted with .I.


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] close_range.2: new page documenting close_range(2)

2020-12-09 Thread Michael Kerrisk (man-pages)
Hello Stephen

Thank you for writing this page! Some comments/questions below.

On Tue, 8 Dec 2020 at 22:51, Stephen Kitt  wrote:
>
> This documents close_range(2) based on information in
> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.

(Thanks for noting these commit IDs.)

> Signed-off-by: Stephen Kitt 
> ---
>  man2/close_range.2 | 112 +
>  1 file changed, 112 insertions(+)
>  create mode 100644 man2/close_range.2
>
> diff --git a/man2/close_range.2 b/man2/close_range.2
> new file mode 100644
> index 0..62167d9b0
> --- /dev/null
> +++ b/man2/close_range.2
> @@ -0,0 +1,112 @@
> +.\" Copyright (c) 2020 Stephen Kitt 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +close_range \- close all file descriptors in a given range
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.PP
> +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR close_range ()
> +system call closes all open file descriptors from
> +.I first
> +to
> +.IR last
> +(included).
> +.PP
> +Errors closing a given file descriptor are currently ignored.
> +.PP
> +.I flags
> +can be set to
> +.B CLOSE_RANGE_UNSHARE
> +to unshare the range of file descriptors from any other processes,
> +.I instead
> +of closing them.

Really "instead of closing them"? I had supposed that rather that this
should be "before closing them". That's also how the kernel code reads
to me, from a quick glance.

> +.SH RETURN VALUE
> +On success,
> +.BR close_range ()
> +return 0.

s/return/returns/

> +On error, \-1 is returned and
> +.I errno
> +is set to indicate the cause of the error.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +.I flags
> +is not valid, or
> +.I first
> +is greater than
> +.IR last .
> +.TP
> +.B EMFILE
> +The per-process limit on the number of open file descriptors has been reached
> +(see the description of
> +.BR RLIMIT_NOFILE
> +in
> +.BR getrlimit (2)).

Given that we are simply closing FDs, how can EMFILE occur?

> +.TP
> +.B ENOMEM
> +Insufficient kernel memory was available.
> +.SH VERSIONS
> +.BR close_range ()
> +first appeared in Linux 5.9.
> +.SH CONFORMING TO
> +.BR close_range ()
> +is available on Linux and FreeBSD.

Here, I think it would be better to write:

close_range()
is a nonstandard function that is also present on FreeBSD.

> +.SH NOTES
> +Currently, there is no glibc wrapper for this system call; call it using
> +.BR syscall (2).
> +.SH USE CASES
> +.\" 278a5fbaed89dacd04e9d052f4594ffd0e0585de
> +.\" 60997c3d45d9a67daf01c56d805ae4fec37e0bd8
> +.SS Closing file descriptors before exec
> +File descriptors can be closed safely using
> +.PP
> +.in +4n
> +.EX
> +/* we don't want anything past stderr here */
> +close_range(3, ~0U, CLOSE_RANGE_UNSHARE);
> +execve();
> +.EE
> +.in
> +.PP

.PP is not necessary before a new subsection (.SS).

> +.SS Closing all open file descriptors
> +This is commonly implemented (on Linux) by listing open file

Is it really true that this is common? I suspect not. It's slow, and
relies on /proc being present. I would have thought that more common
is something like:

int maxfd = sysconf(_SC_OPEN_MAX);
if (maxfd == -1)/* Limit is indeterminate... */
maxfd = 16384;   /* so take a guess */

for (fd = 0; fd < maxfd; fd++)
close(fd);

I think it's fine to mention the use of a /proc as an (inferior and)
alternative way of doing this. I'm just not sure that "commonly" is
correct.

> +descriptors in
> +.B /proc/self/fd/
> +and calling
> +.BR close (2)
> +on each one.
> +.BR close_range ()
> +can take care of this without requiring
> +.B /proc
> +and with a single system call, which provides significant performance
> 

Linux man-pages maintainership adjustments

2020-12-05 Thread Michael Kerrisk (man-pages)
Gidday,

Anyone following linux-man@ in the last few months will
have noticed that Alejandro (Alex) Colomar has become
rather active in the project. Alex has kindly volunteered
to take up some of the work of maintaining the project.
In practice, that means he will be reviewing and merging
some of the patches that land on linux-man@ and I'll be
taking those changes from him to then push to
git.kernel.org.

After 16 years as maintainer, I'm very happy that Alex
has come along to help out. And to be clear, I'm not
planning to step away from the project any time soon,
but maybe one day I will return to being just a
contributor and no longer the maintainer.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] subpage_prot.2: SYNOPSIS: Fix return type: s/long/int/

2020-11-28 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 11/28/20 12:44 AM, Alejandro Colomar wrote:
> The Linux kernel uses 'int' instead of 'long' for the return type.
> As glibc provides no wrapper, use the same type the kernel uses.

Thanks. Patch applied.

Cheers,

Michael

> ..
> 
> $ grep -n wrapper man-pages/man2/subpage_prot.2
> 40:There is no glibc wrapper for this system call; see NOTES.
> 99:Glibc does not provide a wrapper for this system call; call it using
> 
> $ grep -rn SYSCALL_DEFINE.*subpage_prot linux/;
> linux/arch/powerpc/mm/book3s64/subpage_prot.c:190:
> SYSCALL_DEFINE3(subpage_prot, unsigned long, addr,
> 
> $ sed -n /SYSCALL.*subpage_prot/,/^}/p \
>   linux/arch/powerpc/mm/book3s64/subpage_prot.c \
>   |grep return;
>   return -ENOENT;
>   return -EINVAL;
>   return -EINVAL;
>   return 0;
>   return -EFAULT;
>   return -EFAULT;
>   return err;
> 
> $ sed -n /SYSCALL.*subpage_prot/,/^}/p \
>   linux/arch/powerpc/mm/book3s64/subpage_prot.c \
>   |grep '\';
>   int err;
>   err = -ENOMEM;
>   err = -ENOMEM;
>   err = 0;
>   return err;
> 
> Signed-off-by: Alejandro Colomar 
> ---
>  man2/subpage_prot.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man2/subpage_prot.2 b/man2/subpage_prot.2
> index b38ba718f..d6f016665 100644
> --- a/man2/subpage_prot.2
> +++ b/man2/subpage_prot.2
> @@ -32,7 +32,7 @@
>  subpage_prot \- define a subpage protection for an address range
>  .SH SYNOPSIS
>  .nf
> -.BI "long subpage_prot(unsigned long " addr ", unsigned long " len ,
> +.BI "int subpage_prot(unsigned long " addr ", unsigned long " len ,
>  .BI "  uint32_t *" map );
>  .fi
>  .PP
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] spu_create.2: Clarify that one of the prototypes is the current one

2020-11-27 Thread Michael Kerrisk (man-pages)
Hi ALex,

On 11/26/20 7:32 PM, Alejandro Colomar wrote:
> The current Linux kernel only provides a definition of 'spu_create()'.
> It has 4 parameters, the last being 'int neighbor_fd'.
> 
> Before Linux 2.6.23, there was an older prototype,
> which didn't have this last parameter.
> 
> Move that old prototype to VERSIONS,
> and keep the current one in SYNOPSIS.
> 
> ..
> 
> $ grep -rn "SYSCALL_DEFINE.(spu_create"
> arch/powerpc/platforms/cell/spu_syscalls.c:56:
> SYSCALL_DEFINE4(spu_create, const char __user *, name, unsigned int, flags,
> 
> $ sed -n 56,/^}/p arch/powerpc/platforms/cell/spu_syscalls.c
> SYSCALL_DEFINE4(spu_create, const char __user *, name, unsigned int, flags,
>   umode_t, mode, int, neighbor_fd)
> {
>   long ret;
>   struct spufs_calls *calls;
> 
>   calls = spufs_calls_get();
>   if (!calls)
>   return -ENOSYS;
> 
>   if (flags & SPU_CREATE_AFFINITY_SPU) {
>   struct fd neighbor = fdget(neighbor_fd);
>   ret = -EBADF;
>   if (neighbor.file) {
>   ret = calls->create_thread(name, flags, mode, 
> neighbor.file);
>   fdput(neighbor);
>   }
>   } else
>   ret = calls->create_thread(name, flags, mode, NULL);
> 
>   spufs_calls_put(calls);
>   return ret;
> }
> 
> $ git blame arch/powerpc/platforms/cell/spu_syscalls.c -L 56,/\)/
> 1bc94226d5c64 (Al Viro 2011-07-26 16:50:23 -0400 56)
> SYSCALL_DEFINE4(spu_create, const char __user *, name, unsigned int, flags,
> 1bc94226d5c64 (Al Viro 2011-07-26 16:50:23 -0400 57)
>umode_t, mode, int, neighbor_fd)
> 
> $ git checkout 1bc94226d5c64~1
> $ git blame arch/powerpc/platforms/cell/spu_syscalls.c -L /spu_create/,/\)/
> 67207b9664a8d (Arnd Bergmann 2005-11-15 15:53:48 -0500 68)
> asmlinkage long sys_spu_create(const char __user *name,
> 8e68e2f248332 (Arnd Bergmann 2007-07-20 21:39:47 +0200 69)
>  unsigned int flags, mode_t mode, int neighbor_fd)
> 
> $ git checkout 8e68e2f248332~1
> $ git blame arch/powerpc/platforms/cell/spu_syscalls.c -L /spu_create/,/\)/
> 67207b9664a8d (Arnd Bergmann 2005-11-15 15:53:48 -0500 36)
> asmlinkage long sys_spu_create(const char __user *name,
> 67207b9664a8d (Arnd Bergmann 2005-11-15 15:53:48 -0500 37)
>  unsigned int flags, mode_t mode)
> 
> $ git describe --contains 8e68e2f248332
> v2.6.23-rc1~195^2~7
> 
> Signed-off-by: Alejandro Colomar 
> ---
>  man2/spu_create.2 | 16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/man2/spu_create.2 b/man2/spu_create.2
> index 4e6f5d730..3eeafee56 100644
> --- a/man2/spu_create.2
> +++ b/man2/spu_create.2
> @@ -30,9 +30,8 @@ spu_create \- create a new spu context
>  .B #include 
>  .B #include 
>  .PP
> -.BI "int spu_create(const char *" pathname ", int " flags ", mode_t " mode 
> ");"
> -.BI "int spu_create(const char *" pathname ", int " flags ", mode_t " mode 
> ","
> -.BI "   int " neighbor_fd ");"
> +.BI "int spu_create(const char *" pathname ", int " flags ", mode_t " mode ,
> +.BI "   int " neighbor_fd );
>  .fi
>  .PP
>  .IR Note :
> @@ -247,6 +246,17 @@ By convention, it gets mounted in
>  The
>  .BR spu_create ()
>  system call was added to Linux in kernel 2.6.16.
> +.PP
> +.\" commit 8e68e2f248332a9c3fd4f08258f488c209bd3e0c
> +Before Linux 2.6.23, the prototype for
> +.BR spu_create ()
> +was:
> +.PP
> +.in +4n
> +.EX
> +.BI "int spu_create(const char *" pathname ", int " flags ", mode_t " mode );
> +.EE
> +.in
>  .SH CONFORMING TO
>  This call is Linux-specific and implemented only on the PowerPC
>  architecture.

Thanks for the detailed research. The page was indeed a bit messy
in explaining some details. I've instead opted for a different change;
see below.

Thanks,

Michael

diff --git a/man2/spu_create.2 b/man2/spu_create.2
index 92f5fc304..f09d498ed 100644
--- a/man2/spu_create.2
+++ b/man2/spu_create.2
@@ -30,7 +30,6 @@ spu_create \- create a new spu context
 .B #include 
 .B #include 
 .PP
-.BI "int spu_create(const char *" pathname ", int " flags ", mode_t " mode ");"
 .BI "int spu_create(const char *" pathname ", int " flags ", mode_t " mode ","
 .BI "   int " neighbor_fd ");"
 .fi
@@ -89,6 +88,12 @@ for a full list of the possible
 values.
 .PP
 The
+.I neighbor_fd
+is used only when the
+.B SPU_CREATE_AFFINITY_SPU
+flag is specified; see below.
+.PP
+The
 .I flags
 argument can be zero or any bitwise OR-ed
 combination of the following constants:
@@ -264,6 +269,14 @@ See
 .UR http://www.bsc.es\:/projects\:/deepcomputing\:/linuxoncell/
 .UE
 for the recommended libraries.
+.PP
+Prior to the addition of the
+.B SPU_CREATE_AFFINITY_SPU
+flag in Linux 2.6.23, the
+.BR spu_create ()
+system call took only three arguments (i.e., there was no
+.I neighbor_fd
+argument).
 .SH EXAMPLES
 See
 .BR spu_run (2)


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System 

Re: [PATCH v2 2/4] x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ

2020-11-27 Thread Michael Kerrisk (man-pages)
Hey Dave Marin,

On 11/26/20 6:44 PM, Borislav Petkov wrote:
> On Thu, Nov 19, 2020 at 11:02:35AM -0800, Chang S. Bae wrote:
>> Historically, signal.h defines MINSIGSTKSZ (2KB) and SIGSTKSZ (8KB), for
>> use by all architectures with sigaltstack(2). Over time, the hardware state
>> size grew, but these constants did not evolve. Today, literal use of these
>> constants on several architectures may result in signal stack overflow, and
>> thus user data corruption.
>>
>> A few years ago, the ARM team addressed this issue by establishing
>> getauxval(AT_MINSIGSTKSZ), such that the kernel can supply at runtime value
>> that is an appropriate replacement on the current and future hardware.
>>
>> Add getauxval(AT_MINSIGSTKSZ) support to x86, analogous to the support
>> added for ARM in commit 94b07c1f8c39 ("arm64: signal: Report signal frame
>> size to userspace via auxv").
> 
> I don't see it documented here:
> 
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man3/getauxval.3
> 
> Dunno, now that two architectures will have it, maybe that is good
> enough reason to document it.
> 
> Adding Michael.

Commit 94b07c1f8c39 was your, Dave. Might I convince you to write a 
patch for getauxval(3)?

Thanks,


Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: set_thread_area.2: csky architecture undocumented

2020-11-24 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 11/23/20 10:31 PM, Alejandro Colomar (man-pages) wrote:
> Hi Michael,
> 
> SYNOPSIS
>#include 
> 
>#if defined __i386__ || defined __x86_64__
># include 
> 
>int get_thread_area(struct user_desc *u_info);
>int set_thread_area(struct user_desc *u_info);
> 
>#elif defined __m68k__
> 
>int get_thread_area(void);
>int set_thread_area(unsigned long tp);
> 
>#elif defined __mips__
> 
>int set_thread_area(unsigned long addr);
> 
>#endif
> 
>Note: There are no glibc wrappers for these system  calls;  see
>NOTES.
> 
> 
> $ grep -rn 'SYSCALL_DEFINE.*et_thread_area'
> arch/csky/kernel/syscall.c:6:
> SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> arch/mips/kernel/syscall.c:86:
> SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> arch/x86/kernel/tls.c:191:
> SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, u_info)
> arch/x86/kernel/tls.c:243:
> SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, u_info)
> arch/x86/um/tls_32.c:277:
> SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, user_desc)
> arch/x86/um/tls_32.c:325:
> SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, user_desc)
> 
> 
> See kernel commit 4859bfca11c7d63d55175bcd85a75d6cee4b7184
> 
> 
> I'd change
> -  #elif defined __mips__
> +  #elif defined(__mips__ || __csky__)
> 
> and then change the rest of the text to add csky when appropriate.
> Am I correct?

AFAICT, you are correct. I think the reason that csky is missing is
that the architecture was added after this manual pages was added.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] set_tid_address.2: SYNOPSIS: Fix set_tid_address() return type

2020-11-24 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 11/23/20 10:59 PM, Alejandro Colomar wrote:
> The Linux kernel uses 'pid_t' instead of 'long' for the return type.
> As glibc provides no wrapper, use the same types the kernel uses.
> 
> $ sed -n 34,36p man-pages/man2/set_tid_address.2
> .PP
> .IR Note :
> There is no glibc wrapper for this system call; see NOTES.
> 
> $ grep -rn 'SYSCALL_DEFINE.*set_tid_address' linux/
> linux/kernel/fork.c:1632:
> SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
> 
> $ sed -n 1632,1638p linux/kernel/fork.c
> SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
> {
>   current->clear_child_tid = tidptr;
> 
>   return task_pid_vnr(current);
> }
> 
> $ grep -rn 'task_pid_vnr(struct' linux/
> linux/include/linux/sched.h:1374:
> static inline pid_t task_pid_vnr(struct task_struct *tsk)
>
> Signed-off-by: Alejandro Colomar 

Thanks! Patch applied.

Cheers,

Michael

> ---
>  man2/set_tid_address.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man2/set_tid_address.2 b/man2/set_tid_address.2
> index 380efcdd8..b18b8efef 100644
> --- a/man2/set_tid_address.2
> +++ b/man2/set_tid_address.2
> @@ -29,7 +29,7 @@ set_tid_address \- set pointer to thread ID
>  .nf
>  .B #include 
>  .PP
> -.BI "long set_tid_address(int *" tidptr );
> +.BI "pid_t set_tid_address(int *" tidptr );
>  .fi
>  .PP
>  .IR Note :
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] restart_syscall.2: SYNOPSIS: Fix restart_syscall() return type

2020-11-23 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 11/23/20 9:34 PM, Alejandro Colomar wrote:
> The Linux kernel uses 'long' instead of 'int' for the return type.
> As glibc provides no wrapper, use the same types the kernel uses.
> 
> $ grep -rn 'SYSCALL_DEFINE.*(restart_syscall'
> kernel/signal.c:2891:SYSCALL_DEFINE0(restart_syscall)
> 
> $ sed -n 2891,2895p kernel/signal.c
> SYSCALL_DEFINE0(restart_syscall)
> {
>   struct restart_block *restart = >restart_block;
>   return restart->fn(restart);
> }
> 
> $ grep -rn 'struct restart_block {'
> include/linux/restart_block.h:25:struct restart_block {
> 
> $ sed -n 25,56p include/linux/restart_block.h
> struct restart_block {
>   long (*fn)(struct restart_block *);
>   union {
>   /* For futex_wait and futex_wait_requeue_pi */
>   struct {
>   u32 __user *uaddr;
>   u32 val;
>   u32 flags;
>   u32 bitset;
>   u64 time;
>   u32 __user *uaddr2;
>   } futex;
>   /* For nanosleep */
>   struct {
>   clockid_t clockid;
>   enum timespec_type type;
>   union {
>   struct __kernel_timespec __user *rmtp;
>   struct old_timespec32 __user *compat_rmtp;
>   };
>   u64 expires;
>   } nanosleep;
>   /* For poll */
>   struct {
>   struct pollfd __user *ufds;
>   int nfds;
>   int has_timeout;
>   unsigned long tv_sec;
>   unsigned long tv_nsec;
>   } poll;
>   };
> };
> 
> Signed-off-by: Alejandro Colomar 

Thanks! Patch applied.

Cheers,

Michael

> ---
>  man2/restart_syscall.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man2/restart_syscall.2 b/man2/restart_syscall.2
> index e7d96bd4d..21cc2df1d 100644
> --- a/man2/restart_syscall.2
> +++ b/man2/restart_syscall.2
> @@ -34,7 +34,7 @@
>  .SH NAME
>  restart_syscall \- restart a system call after interruption by a stop signal
>  .SH SYNOPSIS
> -.B int restart_syscall(void);
> +.B long restart_syscall(void);
>  .PP
>  .IR Note :
>  There is no glibc wrapper for this system call; see NOTES.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: set_thread_area.2: csky architecture undocumented

2020-11-23 Thread Michael Kerrisk (man-pages)
Hello Alex,

On Mon, 23 Nov 2020 at 22:31, Alejandro Colomar (man-pages)
 wrote:
>
> Hi Michael,
>
> SYNOPSIS
>#include 
>
>#if defined __i386__ || defined __x86_64__
># include 
>
>int get_thread_area(struct user_desc *u_info);
>int set_thread_area(struct user_desc *u_info);
>
>#elif defined __m68k__
>
>int get_thread_area(void);
>int set_thread_area(unsigned long tp);
>
>#elif defined __mips__
>
>int set_thread_area(unsigned long addr);
>
>#endif
>
>Note: There are no glibc wrappers for these system  calls;  see
>NOTES.
>
>
> $ grep -rn 'SYSCALL_DEFINE.*et_thread_area'
> arch/csky/kernel/syscall.c:6:
> SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> arch/mips/kernel/syscall.c:86:
> SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> arch/x86/kernel/tls.c:191:
> SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, u_info)
> arch/x86/kernel/tls.c:243:
> SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, u_info)
> arch/x86/um/tls_32.c:277:
> SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, user_desc)
> arch/x86/um/tls_32.c:325:
> SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, user_desc)
>
>
> See kernel commit 4859bfca11c7d63d55175bcd85a75d6cee4b7184
>
>
> I'd change
> -  #elif defined __mips__
> +  #elif defined(__mips__ || __csky__)
>
> and then change the rest of the text to add csky when appropriate.
> Am I correct?

As far as I can tell, you are correct.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] lseek.2: SYNOPSIS: Use correct types

2020-11-22 Thread Michael Kerrisk (man-pages)
Hi Alex,

On Sat, 21 Nov 2020 at 18:45, Alejandro Colomar (man-pages)
 wrote:
>
> Hi Michael,
>
> I'm a bit lost in all the *lseek* pages.
>
> You had a good read some months ago, so you may know it better.
> I don't know which of those functions come from the kernel,
> and which come from glibc (if any).

It always takes me too long to remind myself of the details here :-(.

This time, I'll try to write what I (re)learned.

Inside the kernel (5.9 sources), in fs/read_write.c, we have:

[[
SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
{
return ksys_lseek(fd, offset, whence);
}

#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE3(lseek, unsigned int, fd, compat_off_t, offset,
unsigned int, whence)
{
return ksys_lseek(fd, offset, whence);
}
#endif

#if !defined(CONFIG_64BIT) || defined(CONFIG_COMPAT) || \
defined(__ARCH_WANT_SYS_LLSEEK)
SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
unsigned long, offset_low, loff_t __user *, result,
unsigned int, whence)
{
...
}
#endif
]]

The main pieces of interest here are the first and last
SYSCALL_DEFINEn. The first is the "standard" lseek() system call that
exists on 64-bit and 32-bit architectures.

The problem on 32-bit architectures is that the off_t type is a 32-bit
type, but files can be bigger than 2GB (2**32-1). That's why 32-bit
kernels also provide the llseek() system call. It receives the new
offset in two 32-bit pieces (offset_high, offset_low), and returns the
new offset via a 64-bit off_t argument (result). (I forget the
reason why there are 32-bit and 64-bit "offset" args in the syscall.)

One more thing... In arch/x86/entry/syscalls/syscall_32.tbl,
we see the following line:

[[
140 i386_llseek sys_llseek
]]

This is essentially telling us that 'sys_llseek' (the name generated
by SYSCALL_DEFINE5(llseek...)) is exposed to user-space as system call
number 140, and that system call number will (IIUC) be exposed in
autogenerated headers with the name "__NR__llseek" (i.e., "_llseek").
The "i386" is
telling us that this happens in i386 (32-bit Intel). There is nothing
equivalent on x86-64, because 64 bit systems don't need an _llseek
system call.

Now, in ancient times (let's say Linux 2.2), there was a more
transparent situation (but the effect was the same):

#define __NR__llseek140

and that system call number was tied to the implementation by this definition
linux-2.2.26/arch/i386/kernel/entry.S:

.long SYMBOL_NAME(sys_llseek)   /* 140 */

==

lseek64() is a C library function.  It takes and returns a 64-bit
offset. It exists to support seeking in large (>2GB) files. Its
implementation is in the glibc source file
sysdeps/unix/sysv/linux/lseek64.c, where it calls _llseek(2)

Returning to the  header file, we have:

[[
#ifndef __USE_FILE_OFFSET64
extern __off_t lseek (int __fd, __off_t __offset, int __whence) __THROW;
#else
# ifdef __REDIRECT_NTH
extern __off64_t __REDIRECT_NTH (lseek,
 (int __fd, __off64_t __offset, int __whence),
 lseek64);
# else
#  define lseek lseek64
# endif
#endif
#ifdef __USE_LARGEFILE64
extern __off64_t lseek64 (int __fd, __off64_t __offset, int __whence)
 __THROW;
#endif
]]

The name "lseek64" is exposed if _LARGEFILE64_SOURCE (which triggers
__USE_LARGEFILE64) is defined. That name was part of the so-called
Transitional Large FIle Systems (LFS) API (see page 105 in my book),
which existed to support the use of 64-bit file offsets on 32 bit
systems. It provided a set of interfaces with names of the form
"x64()" (e.g., "lseek64")) which provided for 64-bit offsets;
those names coexisted with the traditional 32-bit APIs (e.g.,
"lseek").

Alternatively, the LFS specified a macro, _FILE_OFFSET_BITS=64 (which
triggers __USE_FILE_OFFSET64) as another way of exposing 64-bit-offset
functionality on 32 bit systems. In this case, the traditional API
names (e.g., "lseek") are redirected to the 64-bit implementations
(e.g., "lseek64");

> In the kernel I only found the lseek, llseek, and 32_llseek

I'd ignore 32_llseek -- I guess that's an arch-specific equivalent of
_llseek/llseek.

> (as you can see in the patch).
> So if any other prototype needs to be updated, please do so.
> Especially, have a look at lseek64(3),
> which I suspect needs the same changes I propose in that patch.

I think that no changes to the types are needed in lseek64(3). But
maybe some of the info in this mail should be captured in that manual
page.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] lseek.2: SYNOPSIS: Use correct types

2020-11-22 Thread Michael Kerrisk (man-pages)
[Adding libc-alpha@ here, so someone might correct me if I make a misstep]

Hello Alex,

On Sat, 21 Nov 2020 at 18:34, Alejandro Colomar  wrote:
>
> The Linux kernel uses 'unsigned int' instead of 'int'
> for 'fd' and 'whence'.
> As glibc provides no wrapper, use the same types the kernel uses.

I see Florian already replied, but just to add a detail or two...

In general, the manual pages explicitly note the APIs that have no
glibc wrapper. (If not, that's a bug in the page, but I don't expect
there are many such bugs.)

Looking in , we have:

[[
#ifndef __USE_FILE_OFFSET64
extern __off_t lseek (int __fd, __off_t __offset, int __whence) __THROW;
#else
# ifdef __REDIRECT_NTH
extern __off64_t __REDIRECT_NTH (lseek,
 (int __fd, __off64_t __offset, int __whence),
 lseek64);
# else
#  define lseek lseek64
# endif
#endif
#ifdef __USE_LARGEFILE64
extern __off64_t lseek64 (int __fd, __off64_t __offset, int __whence)
 __THROW;
#endif
]]

It looks to me like there's a prototype hiding in there. (And yes, I
don't find it so funny to decode the macro logic either.)

Thanks,

Michael

PS By the way, be aware that the code of many wrapper functions is
autogenerated from "syscalls.list" files in the glibc source, for
example, sysdeps/unix/sysv/linux/syscalls.list. This isn't the case
for lseek(), though, as far as I can see; I think the wrapper function
is defined in sysdeps/unix/sysv/linux/lseek.c.



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Michael Kerrisk (man-pages)
Hello Sargun,

Thanks for your reply!

On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
>  wrote:
>>
>> Hello Sargun,
>>
>> Thanks for your reply.
>>
>> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
>>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
>>> wrote:
>>
>> [...]
>>
>>>>> I think I commented in another thread somewhere that the
>>>>> supervisor is not notified if the syscall is preempted. Therefore
>>>>> if it is performing a preemptible, long-running syscall, you need
>>>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
>>>>> you can end up in a bad situation -- like leaking resources, or
>>>>> holding on to file descriptors after the program under
>>>>> supervision has intended to release them.
>>>>
>>>> It's been a long day, and I'm not sure I reallu understand this.
>>>> Could you outline the scnario in more detail?
>>>>
>>> S: Sets up filter + interception for accept T: socket(AF_INET,
>>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
>>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>>
>> Presumably, the preceding line should have been:
>>
>> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>> (s/T:/S:/)
>>
>> right?
> 
> Right.
>>
>>
>>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
>>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
>>> Still running accept(7, ), holding port , so if now T
>>> retries to bind to port , things fail.
>>
>> Okay -- I understand. Presumably the solution here is not to
>> block in accept(), but rather to use poll() to monitor both the
>> notification FD and the listening socket FD?
>>
> You need to have some kind of mechanism to periodically check
> if the notification is still alive, and preempt the accept. It doesn't
> matter how exactly you "background" the accept (threads, or
> O_NONBLOCK + epoll).
> 
> The thing is you need to make sure that when the process
> cancels a syscall, you need to release the resources you
> may have acquired on its behalf or bad things can happen.
> 

Got it. I added the following text:

   Caveats regarding blocking system calls
   Suppose that the target performs a blocking system call (e.g.,
   accept(2)) that the supervisor should handle.  The supervisor
   might then in turn execute the same blocking system call.

   In this scenario, it is important to note that if the target's
   system call is now interrupted by a signal, the supervisor is not
   informed of this.  If the supervisor does not take suitable steps
   to actively discover that the target's system call has been
   canceled, various difficulties can occur.  Taking the example of
   accept(2), the supervisor might remain blocked in its accept(2)
   holding a port number that the target (which, after the
   interruption by the signal handler, perhaps closed  its listening
   socket) might expect to be able to reuse in a bind(2) call.

   Therefore, when the supervisor wishes to emulate a blocking system
   call, it must do so in such a way that it gets informed if the
   target's system call is interrupted by a signal handler.  For
   example, if the supervisor itself executes the same blocking
   system call, then it could employ a separate thread that uses the
   SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
   still blocked in its system call.  Alternatively, in the accept(2)
   example, the supervisor might use poll(2) to monitor both the
   notification file descriptor (so as as to discover when the
   target's accept(2) call has been interrupted) and the listening
   file descriptor (so as to know when a connection is available).

   If the target's system call is interrupted, the supervisor must
   take care to release resources (e.g., file descriptors) that it
   acquired on behalf of the target.

Does that seem okay?

>>>>> ENOENT The cookie number is not valid. This can happen if a
>>>>> response has already been sent, or if the syscall was
>>>>> interrupted
>>>>>
>>>>> EBADF If the file descriptor specified in srcfd is invalid, or if
>>>>> the fd is out of range of the destination program.
>>>>
>>>> The piece "or if the fd is out of range of the destination program"
>>>>

man-pages-5.09 is released

2020-11-01 Thread Michael Kerrisk (man-pages)
Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-5.09 - man pages for Linux

This release resulted from patches, bug reports, reviews, and
comments from more than 40 people, with just over 500 commits making
changes to nearly 600 pages. Nine new pages have been added (six
of these are the result of splitting the rather unwieldy queue(3)
page into a number of small pieces). Special shout out to
Alejandro Colomar, who provided more than half (265!) of the commits.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_5.09

A short summary of the release is blogged at:
https://linux-man-pages.blogspot.com/2020/11/man-pages-509-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers of LKML is shown below.

Cheers,

Michael


 Changes in man-pages-5.09 

New and rewritten pages
---

system_data_types.7
Alejandro Colomar, Michael Kerrisk
A new page documenting a wide range of system data types.

kernel_lockdown.7
David Howells, Heinrich Schuchardt  [Michael Kerrisk]
New page documenting the Kernel Lockdown feature


Newly documented interfaces in existing pages
-

fanotify_init.2
fanotify.7
Amir Goldstein  [Jan Kara, Matthew Bobrowski]
Document FAN_REPORT_DIR_FID

fanotify_init.2
fanotify.7
Amir Goldstein  [Jan Kara, Matthew Bobrowski]
Document FAN_REPORT_NAME

statx.2
Ira Weiny
Add STATX_ATTR_DAX

strerror.3
Michael Kerrisk
Document strerrorname_np() and strerrordesc_np()

strsignal.3
Michael Kerrisk
Document sigabbrev_np() and sigdescr_np().

loop.4
Yang Xu
Document LOOP_CONFIGURE ioctl
Yang Xu
Document LO_FLAGS_DIRECT_IO flag

capabilities.7
Michael Kerrisk
Document the CAP_CHECKPOINT_RESTORE capability added in Linux 5.9

ip.7
Stephen Smalley  [Paul Moore]
Document IP_PASSSEC for UDP sockets

ip.7
socket.7
Stephen Smalley
Document SO_PEERSEC for AF_INET sockets
Sridhar Samudrala
Document SO_INCOMING_NAPI_ID

socket.7
unix.7
Stephen Smalley  [Serge Hallyn, Simon McVittie]
Add initial description for SO_PEERSEC


Changes to individual pages
---

clone.2
Michael Kerrisk
CAP_CHECKPOINT_RESTORE can now be used to employ 'set_tid'

epoll_ctl.2
Michael Kerrisk
epoll instances can be nested to a maximum depth of 5
This limit appears to be an off-by-one count against
EP_MAX_NESTS (4).
perf_event_open.2
Alexey Budankov
Update the man page with CAP_PERFMON related information

seccomp.2
Michael Kerrisk  [Jann Horn]
Warn reader that SECCOMP_RET_TRACE can be overridden
Highlight to the reader that if another filter returns a
higher-precedence action value, then the ptracer will not
be notified.
Michael Kerrisk  [Rich Felker]
Warn against the use of SECCOMP_RET_KILL_THREAD
Killing a thread with SECCOMP_RET_KILL_THREAD is very likely
to leave the rest of the process in a broken state.

dlopen.3
Michael Kerrisk
Clarify DT_RUNPATH/DT_RPATH details
It is the DT_RUNPATH/DT_RPATH of the calling object (not the
executable) that is relevant for the library search. Verified
by experiment.

loop.4
Yang Xu
Add some details about lo_flags

proc.5
Michael Kerrisk
Update capability requirements for accessing /proc/[pid]/map_files
Jann Horn  [Mark Mossberg]
Document inaccurate RSS due to SPLIT_RSS_COUNTING
Michael Kerrisk
Note "open file description" as (better) synonym for "file handle"

bpf-helpers.7
Michael Kerrisk  [Jakub Wilk]
Resync with current kernel source

capabilities.7
Michael Kerrisk
Under CAP_SYS_ADMIN, group "sub-capabilities" together
CAP_BPF, CAP_PERFMON, and CAP_CHECKPOINT_RESTORE have all been
added to split out the power of CAP_SYS_ADMIN into weaker pieces.
Group all of these capabilities together in the list under
CAP_SYS_ADMIN, to make it clear that there is a pattern to these
capabilities.

fanotify.7
fanotify_mark.2
Amir Goldstein  [Jan Kara, Matthew Bobrowski]
Generalize documentation of FAN_REPORT_FID

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
Hello Sargun,

Thanks for your reply.

On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> wrote:

[...]

>>> I think I commented in another thread somewhere that the
>>> supervisor is not notified if the syscall is preempted. Therefore
>>> if it is performing a preemptible, long-running syscall, you need
>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
>>> you can end up in a bad situation -- like leaking resources, or
>>> holding on to file descriptors after the program under
>>> supervision has intended to release them.
>> 
>> It's been a long day, and I'm not sure I reallu understand this. 
>> Could you outline the scnario in more detail?
>> 
> S: Sets up filter + interception for accept T: socket(AF_INET,
> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.

Presumably, the preceding line should have been:

S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
(s/T:/S:/)

right?

> T: accept(7, ...) S: Intercepts accept S: Does accept in background 
> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> Still running accept(7, ), holding port , so if now T
> retries to bind to port , things fail.

Okay -- I understand. Presumably the solution here is not to 
block in accept(), but rather to use poll() to monitor both the
notification FD and the listening socket FD?

>>> A very specific example is if you're performing an accept on
>>> behalf of the program generating the notification, and the
>>> program intends to reuse the port. You can get into all sorts of
>>> awkward situations there.
>> 
>> [...]
>> 
> See above

[...]

>>> In addition, if it is a socket, it inherits the cgroup v1 classid
>>> and netprioidx of the receiving process.
>>> 
>>> The argument of this is as follows:
>>> 
>>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd; 
>>> __u32 newfd; __u32 newfd_flags; };
>>> 
>>> id This is the cookie value that was obtained using 
>>> SECCOMP_IOCTL_NOTIF_RECV.
>>> 
>>> flags A bitmask that includes zero or more of the 
>>> SECCOMP_ADDFD_FLAG_* bits set
>>> 
>>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
>>> when copying the file descriptor.
>>> 
>>> srcfd The file descriptor number to copy in the supervisor
>>> process.
>>> 
>>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
>>> be the file descriptor that is used in the dup2 semantics. If
>>> this file descriptor exists in the receiving process, it is
>>> closed and replaced by this file descriptor in an atomic fashion.
>>> If the copy process fails due to a MAC failure, or if srcfd is
>>> invalid, the newfd will not be closed in the receiving process.
>> 
>> Great description!
>> 
>>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
>>> 0.
>>> 
>>> newfd_flags The file descriptor flags to set on the file
>>> descriptor after it has been received by the process. The only
>>> flag that can currently be specified is O_CLOEXEC.
>>> 
>>> On success, this operation returns the file descriptor number in
>>> the receiving process. On failure, -1 is returned.
>>> 
>>> It can fail with the following error codes:
>>> 
>>> EINPROGRESS The cookie number specified hasn't been received by
>>> the listener
>> 
>> I don't understand this. Can you say more about the scenario?
>> 
> 
> This should not really happen. But if you do a ADDFD(...), on a
> notification *before* you've received it, you will get this error. So
> for example, 
> --> epoll() -> returns 
> --> RECV(...) cookie id is 777
> --> epoll(...) -> returns
> <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> where we've received a notification for 778.

Got it. Looking also at the source code, I came up with the 
following:

  EINPROGRESS
 The user-space notification specified in the id
 field exists but has not yet been fetched (by a
 SECCOMP_IOCTL_NOTIF_RECV) or has already been
 responded to (by a SECCOMP_IOCTL_NOTIF_SEND).

Does that seem okay?

>>> ENOENT The cookie number is not valid. This can happen if a
>>> response has already been sent, or if the syscall was
>>> interrupted
>>&g

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:20 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 2:42 AM, Jann Horn wrote:
>>> As discussed at
>>> <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
>>> we need to re-check checkNotificationIdIsValid() after reading remote
>>> memory but before using the read value in any way. Otherwise, the
>>> syscall could in the meantime get interrupted by a signal handler, the
>>> signal handler could return, and then the function that performed the
>>> syscall could free() allocations or return (thereby freeing buffers on
>>> the stack).
>>>
>>> In essence, this pread() is (unavoidably) a potential use-after-free
>>> read; and to make that not have any security impact, we need to check
>>> whether UAF read occurred before using the read value. This should
>>> probably be called out elsewhere in the manpage, too...
>>>
>>> Now, of course, **reading** is the easy case. The difficult case is if
>>> we have to **write** to the remote process... because then we can't
>>> play games like that. If we write data to a freed pointer, we're
>>> screwed, that's it. (And for somewhat unrelated bonus fun, consider
>>> that /proc/$pid/mem is originally intended for process debugging,
>>> including installing breakpoints, and will therefore happily write
>>> over "readonly" private mappings, such as typical mappings of
>>> executable code.)
>>>
>>> So, h... I guess if anyone wants to actually write memory back to
>>> the target process, we'd better come up with some dedicated API for
>>> that, using an ioctl on the seccomp fd that magically freezes the
>>> target process inside the syscall while writing to its memory, or
>>> something like that? And until then, the manpage should have a big fat
>>> warning that writing to the target's memory is simply not possible
>>> (safely).
>>
>> Thank you for your very clear explanation! It turned out to be
>> trivially easy to demonstrate this issue with a slightly modified
>> version of my program.
>>
>> As well as the change to the code example that I already mentioned
>> my reply of a few hours ago, I've added the following text to the
>> page:
>>
>>Caveats regarding the use of /proc/[tid]/mem
>>The discussion above noted the need to use the
>>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
>>/proc/[tid]/mem file of the target to avoid the possibility of
>>accessing the memory of the wrong process in the event that the
>>target terminates and its ID is recycled by another (unrelated)
>>thread.  However, the use of this ioctl(2) operation is also
>>necessary in other situations, as explained in the following
>>pargraphs.
> 
> (nit: paragraphs)

I spotted that one also already. But thanks for reading carefully!

>>Consider the following scenario, where the supervisor tries to
>>read the pathname argument of a target's blocked mount(2) system
>>call:
> [...]
>> Seem okay?
> 
> Yeah, sounds good.
> 
>> By the way, is there any analogous kind of issue concerning
>> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
>> something.
> 
> When it is used by a seccomp supervisor, you mean? I think basically
> the same thing applies - when resource identifiers (such as memory
> addresses or file descriptors) are passed to a syscall, it generally
> has to be assumed that those identifiers may become invalid and be
> reused as soon as the syscall has returned.

I probably needed to be more explicit. Would the following (i.e., a
single cookie check) not be sufficient to handle the above scenario.
Here, the target is making a syscall a system call that employs the
file descriptor 'tfd':

T: makes syscall that triggers notification
S: Get notification
S: pidfd = pidfd_open(T, 0);
S: sfd = pifd_getfd(pidfd, tfd, 0)
S: check that the cookie is still valid
S: do operation with sfd [*]

By contrast, I can see that we might want to do multiple cookie
checks in the /proc/PID/mem case, since the supervisor might do
multiple reads.

Or, do you mean: there really needs to be another cookie check after
the point [*], since, if the the target's syscall was interrupted
and 'tfd' was closed/resused, then the supervisor would be operating
with a file descriptor that refers to an open file description
(a "struct file") that is no longer meaningful in the target?
(Thinking about it, I think this probably is what you mean, but 
I want to confirm.)

Thanks,

Michael
-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:14 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 3:19 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 2:42 AM, Jann Horn wrote:
>>> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>>>  wrote:
>>>>static bool
>>>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>>>  char *path, size_t len)
>>>>{
>>>>char procMemPath[PATH_MAX];
>>>>
>>>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>>>> req->pid);
>>>>
>>>>int procMemFd = open(procMemPath, O_RDONLY);
>>>>if (procMemFd == -1)
>>>>errExit("\tS: open");
>>>>
>>>>/* Check that the process whose info we are accessing is still 
>>>> alive.
>>>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>>>   /proc/PID/mem file descriptor that we opened corresponds to 
>>>> the
>>>>   process for which we received a notification. If that process
>>>>   subsequently terminates, then read() on that file descriptor
>>>>   will return 0 (EOF). */
>>>>
>>>>checkNotificationIdIsValid(notifyFd, req->id);
>>>>
>>>>/* Read bytes at the location containing the pathname argument
>>>>   (i.e., the first argument) of the mkdir(2) call */
>>>>
>>>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>>>if (nread == -1)
>>>>errExit("pread");
>>>
>>> As discussed at
>>> <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
>>> we need to re-check checkNotificationIdIsValid() after reading remote
>>> memory but before using the read value in any way. Otherwise, the
>>> syscall could in the meantime get interrupted by a signal handler, the
>>> signal handler could return, and then the function that performed the
>>> syscall could free() allocations or return (thereby freeing buffers on
>>> the stack).
>>>
>>> In essence, this pread() is (unavoidably) a potential use-after-free
>>> read; and to make that not have any security impact, we need to check
>>> whether UAF read occurred before using the read value. This should
>>> probably be called out elsewhere in the manpage, too...
>>
>> Thanks very much for pointing me at this!
>>
>> So, I want to conform that the fix to the code is as simple as
>> adding a check following the pread() call, something like:
>>
>> [[
>>  ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
>>  if (nread == -1)
>> errExit("Supervisor: pread");
>>
>>  if (nread == 0) {
>> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>> "returned 0 (EOF)\n");
>> exit(EXIT_FAILURE);
>>  }
>>
>>  if (close(procMemFd) == -1)
>> errExit("Supervisor: close-/proc/PID/mem");
>>
>> +/* Once again check that the notification ID is still valid. The
>> +   case we are particularly concerned about here is that just
>> +   before we fetched the pathname, the target's blocked system
>> +   call was interrupted by a signal handler, and after the handler
>> +   returned, the target carried on execution (past the interrupted
>> +   system call). In that case, we have no guarantees about what we
>> +   are reading, since the target's memory may have been arbitrarily
>> +   changed by subsequent operations. */
>> +
>> +if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
>> +return false;
>> +
>>  /* We have no guarantees about what was in the memory of the target
>> process. We therefore treat the buffer returned by pread() as
>> untrusted input. The buffer should be terminated by a null byte;
>> if not, then we will trigger an error for the target process. */
>>
>>  if (strnlen(path, nread) < nread)
>>  return true;
>> ]]
> 
> Yeah, that should do the job. 

Thanks.

> With the caveat that a canc

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:24 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 8:53 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 4:26 PM, Christian Brauner wrote:
>>> I like this manpage. I think this is the most comprehensive explanation
>>> of any seccomp feature
>>
>> Thanks (at least, I think so...)
>>
>>> and somewhat understandable.
>>   
>>
>> (... but I'm not sure ;-).)
> 
> Relevant: http://tinefetz.net/files/gimgs/78_78_17.jpg

Perfekt :-).


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 2/2] futex.2: Use appropriate types

2020-10-30 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 10/30/20 2:46 PM, Alejandro Colomar wrote:
> BTW, apparently the kernel doesn't use 'const' for 'utime'
> ('timeout' in the manual page),
> but effectively, it doesn't modify it, AFAICS.
> 
> Should the kernel use 'const'?
> Is there a reason for the kernel not using 'const'?
> Should we do anything about it in the manual page?

I'm not sure about the kernel, but I think we don't need to 
worry in the manual page.

Thanks,

Michael

> On 2020-10-30 13:39, Alejandro Colomar wrote:
>> The Linux kernel uses the following:
>>
>> kernel/futex.c:3778:
>> SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>>  struct __kernel_timespec __user *, utime, u32 __user *, uaddr2,
>>  u32, val3)
>>
>> Since there is no glibc wrapper, use the same types the kernel uses.
>>
>> Signed-off-by: Alejandro Colomar 
>> ---
>>   man2/futex.2 | 27 ++-
>>   1 file changed, 14 insertions(+), 13 deletions(-)
>>
>> diff --git a/man2/futex.2 b/man2/futex.2
>> index 837adbd25..73de71623 100644
>> --- a/man2/futex.2
>> +++ b/man2/futex.2
>> @@ -26,12 +26,13 @@ futex \- fast user-space locking
>>   .nf
>>   .PP
>>   .B #include 
>> +.B #include 
>>   .B #include 
>>   .PP
>> -.BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
>> +.BI "long futex(uint32_t *" uaddr ", int " futex_op ", uint32_t " val ,
>>   .BI "  const struct timespec *" timeout , \
>>   " \fR  /* or: \fBuint32_t \fIval2\fP */"
>> -.BI "  int *" uaddr2 ", int " val3 );
>> +.BI "  uint32_t *" uaddr2 ", uint32_t " val3 );
>>   .fi
>>   .PP
>>   .IR Note :
>> @@ -581,8 +582,8 @@ any of the two supplied futex words:
>>   .IP
>>   .in +4n
>>   .EX
>> -int oldval = *(int *) uaddr2;
>> -*(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
>> +uint32_t oldval = *(uint32_t *) uaddr2;
>> +*(uint32_t *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
>>   futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
>>   if (oldval \fIcmp\fP \fIcmparg\fP)
>>   futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
>> @@ -1765,11 +1766,11 @@ Child  (18535) 4
>>   #define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \e
>>   } while (0)
>>   
>> -static int *futex1, *futex2, *iaddr;
>> +static uint32_t *futex1, *futex2, *iaddr;
>>   
>>   static int
>> -futex(int *uaddr, int futex_op, int val,
>> -  const struct timespec *timeout, int *uaddr2, int val3)
>> +futex(uint32_t *uaddr, int futex_op, uint32_t val,
>> +  const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
>>   {
>>   return syscall(SYS_futex, uaddr, futex_op, val,
>>  timeout, uaddr2, val3);
>> @@ -1779,9 +1780,9 @@ futex(int *uaddr, int futex_op, int val,
>>  become 1, and then set the value to 0. */
>>   
>>   static void
>> -fwait(int *futexp)
>> +fwait(uint32_t *futexp)
>>   {
>> -int s;
>> +long s;
>>   
>>   /* atomic_compare_exchange_strong(ptr, oldval, newval)
>>  atomically performs the equivalent of:
>> @@ -1794,7 +1795,7 @@ fwait(int *futexp)
>>   while (1) {
>>   
>>   /* Is the futex available? */
>> -const int one = 1;
>> +const uint32_t one = 1;
>>   if (atomic_compare_exchange_strong(futexp, , 0))
>>   break;  /* Yes */
>>   
>> @@ -1811,13 +1812,13 @@ fwait(int *futexp)
>>  so that if the peer is blocked in fpost(), it can proceed. */
>>   
>>   static void
>> -fpost(int *futexp)
>> +fpost(uint32_t *futexp)
>>   {
>> -int s;
>> +long s;
>>   
>>   /* atomic_compare_exchange_strong() was described in comments above */
>>   
>> -const int zero = 0;
>> +const uint32_t zero = 0;
>>   if (atomic_compare_exchange_strong(futexp, , 1)) {
>>   s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
>>   if (s  == \-1)
>>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/2] futex.2: srcfix

2020-10-30 Thread Michael Kerrisk (man-pages)
On 10/30/20 1:39 PM, Alejandro Colomar wrote:
> Signed-off-by: Alejandro Colomar 

Hi Alex,

I've applied this patch, but would prefer to avoid such
patches in the future. Nothing is actually broken in the 
old version, so I tend to regard such patches as unnecessary
chur,.

Thanks,

Michael

> ---
>  man2/futex.2 | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/futex.2 b/man2/futex.2
> index f82602c11..837adbd25 100644
> --- a/man2/futex.2
> +++ b/man2/futex.2
> @@ -25,8 +25,8 @@ futex \- fast user-space locking
>  .SH SYNOPSIS
>  .nf
>  .PP
> -.B "#include "
> -.B "#include "
> +.B #include 
> +.B #include 
>  .PP
>  .BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
>  .BI "  const struct timespec *" timeout , \
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 2/2] futex.2: Use appropriate types

2020-10-30 Thread Michael Kerrisk (man-pages)
On 10/30/20 1:39 PM, Alejandro Colomar wrote:
> The Linux kernel uses the following:
> 
> kernel/futex.c:3778:
> SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>   struct __kernel_timespec __user *, utime, u32 __user *, uaddr2,
>   u32, val3)
> 
> Since there is no glibc wrapper, use the same types the kernel uses.

Thanks. Patch applied.

Cheers,

Michael

> Signed-off-by: Alejandro Colomar 
> ---
>  man2/futex.2 | 27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/man2/futex.2 b/man2/futex.2
> index 837adbd25..73de71623 100644
> --- a/man2/futex.2
> +++ b/man2/futex.2
> @@ -26,12 +26,13 @@ futex \- fast user-space locking
>  .nf
>  .PP
>  .B #include 
> +.B #include 
>  .B #include 
>  .PP
> -.BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
> +.BI "long futex(uint32_t *" uaddr ", int " futex_op ", uint32_t " val ,
>  .BI "  const struct timespec *" timeout , \
>  " \fR  /* or: \fBuint32_t \fIval2\fP */"
> -.BI "  int *" uaddr2 ", int " val3 );
> +.BI "  uint32_t *" uaddr2 ", uint32_t " val3 );
>  .fi
>  .PP
>  .IR Note :
> @@ -581,8 +582,8 @@ any of the two supplied futex words:
>  .IP
>  .in +4n
>  .EX
> -int oldval = *(int *) uaddr2;
> -*(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
> +uint32_t oldval = *(uint32_t *) uaddr2;
> +*(uint32_t *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
>  futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
>  if (oldval \fIcmp\fP \fIcmparg\fP)
>  futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
> @@ -1765,11 +1766,11 @@ Child  (18535) 4
>  #define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \e
>  } while (0)
>  
> -static int *futex1, *futex2, *iaddr;
> +static uint32_t *futex1, *futex2, *iaddr;
>  
>  static int
> -futex(int *uaddr, int futex_op, int val,
> -  const struct timespec *timeout, int *uaddr2, int val3)
> +futex(uint32_t *uaddr, int futex_op, uint32_t val,
> +  const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
>  {
>  return syscall(SYS_futex, uaddr, futex_op, val,
> timeout, uaddr2, val3);
> @@ -1779,9 +1780,9 @@ futex(int *uaddr, int futex_op, int val,
> become 1, and then set the value to 0. */
>  
>  static void
> -fwait(int *futexp)
> +fwait(uint32_t *futexp)
>  {
> -int s;
> +long s;
>  
>  /* atomic_compare_exchange_strong(ptr, oldval, newval)
> atomically performs the equivalent of:
> @@ -1794,7 +1795,7 @@ fwait(int *futexp)
>  while (1) {
>  
>  /* Is the futex available? */
> -const int one = 1;
> +const uint32_t one = 1;
>  if (atomic_compare_exchange_strong(futexp, , 0))
>  break;  /* Yes */
>  
> @@ -1811,13 +1812,13 @@ fwait(int *futexp)
> so that if the peer is blocked in fpost(), it can proceed. */
>  
>  static void
> -fpost(int *futexp)
> +fpost(uint32_t *futexp)
>  {
> -int s;
> +long s;
>  
>  /* atomic_compare_exchange_strong() was described in comments above */
>  
> -const int zero = 0;
> +const uint32_t zero = 0;
>  if (atomic_compare_exchange_strong(futexp, , 1)) {
>  s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
>  if (s  == \-1)
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Sargun,,

On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:

[...]

>>ioctl(2) operations
>>The following ioctl(2) operations are provided to support seccomp
>>user-space notification.  For each of these operations, the first
>>(file descriptor) argument of ioctl(2) is the listening file
>>descriptor returned by a call to seccomp(2) with the
>>SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>>SECCOMP_IOCTL_NOTIF_RECV
>>   This operation is used to obtain a user-space notification
>>   event.  If no such event is currently pending, the
>>   operation blocks until an event occurs.  The third
>>   ioctl(2) argument is a pointer to a structure of the
>>   following form which contains information about the event.
>>   This structure must be zeroed out before the call.
>>
>>   struct seccomp_notif {
>>   __u64  id;  /* Cookie */
>>   __u32  pid; /* TID of target thread */
>>   __u32  flags;   /* Currently unused (0) */
>>   struct seccomp_data data;   /* See seccomp(2) */
>>   };
>>
>>   The fields in this structure are as follows:
>>
>>   id This is a cookie for the notification.  Each such
>>  cookie is guaranteed to be unique for the
>>  corresponding seccomp filter.
>>
>>  · It can be used with the
>>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
>>to verify that the target is still alive.
>>
>>  · When returning a notification response to the
>>kernel, the supervisor must include the cookie
>>value in the seccomp_notif_resp structure that is
>>specified as the argument of the
>>SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>>   pidThis is the thread ID of the target thread that
>>  triggered the notification event.
>>
>>   flags  This is a bit mask of flags providing further
>>  information on the event.  In the current
>>  implementation, this field is always zero.
>>
>>   data   This is a seccomp_data structure containing
>>  information about the system call that triggered
>>  the notification.  This is the same structure that
>>  is passed to the seccomp filter.  See seccomp(2)
>>  for details of this structure.
>>
>>   On success, this operation returns 0; on failure, -1 is
>>   returned, and errno is set to indicate the cause of the
>>   error.  This operation can fail with the following errors:
>>
>>   EINVAL (since Linux 5.5)
>>  The seccomp_notif structure that was passed to the
>>  call contained nonzero fields.
>>
>>   ENOENT The target thread was killed by a signal as the
>>  notification information was being generated, or
>>  the target's (blocked) system call was interrupted
>>  by a signal handler.
>>
>>┌─┐
>>│FIXME│
>>├─┤
>>│From my experiments, it appears that if a│
>>│SECCOMP_IOCTL_NOTIF_RECV is done after the target│
>>│thread terminates, then the ioctl() simply blocks│
>>│(rather than returning an error to indicate that the │
>>│target no longer exists).│
>>│ │
>>│I found that surprising, and it required some│
>>│contortions in the example program.  It was not  │
>>│possible to code my SIGCHLD handler (which reaps the │
>>│zombie when the worker/target terminates) to simply  │
>>│set a flag checked in the main handleNotifications() │
>>│loop, since this created an unavoidable race where   │
>>│the child might term

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Christian

Thanks for taking a look at the page.

On 10/29/20 4:26 PM, Christian Brauner wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi all (and especially Tycho and Sargun),
>>
>> Following review comments on the first draft (thanks to Jann, Kees,
>> Christian and Tycho), I've made a lot of changes to this page.
>> I've also added a few FIXMEs relating to outstanding API issues.
>> I'd like a second pass review of the page before I release it.
>> But also, this mail serves as a way of noting the outstanding API
>> issues.
>>
>> Tycho: I still have an outstanding question for you at [2].
>>
>> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> I've shown the rendered version of the page below. The page source
>> currently sits in a branch at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> At this point, I'm mainly interested in feedback about the FIXMEs,
>> some of which relate to the text of the page itself, while the
>> others relate to the various outstanding API issues. The first 
>> FIXME provides a small opportunity for some bikeshedding :-);
> 
> I like this manpage. I think this is the most comprehensive explanation
> of any seccomp feature

Thanks (at least, I think so...)

> and somewhat understandable.
  

(... but I'm not sure ;-).)

> Just tiny comments below, hopefully no bike-shedding though. :)

Most relevant point for bikeshedding is the page name. I plan 
to change it to seccomp_unotify(2) (shorter, reads better out loud).

>> Thanks,
>>
>> Michael
>>
>> [1] 
>> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
>> [2] 
>> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
>>
>> =
>>
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

[...]

>>An overview of the steps performed by the target and the
>>supervisor is as follows:
>>
>>1. The target establishes a seccomp filter in the usual manner,
>>   but with two differences:
>>
>>   · The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>> value of the (successful) seccomp(2) call is a new
>> "listening" file descriptor that can be used to receive
>> notifications.  Only one "listening" seccomp filter can be
>> installed for a thread.
>>
>> ┌─┐
>> │FIXME│
>> ├─┤
>> │Is the last sentence above correct?  │
>> │ │
>> │Kees Cook (25 Oct 2020) notes:   │
>> │ │
>> │I like this limitation, but I expect that it'll need │
>> │to change in the future. Even with LSMs, we see the  │
>> │need for arbitrary stacking, and the idea of there   │
>> │being only 1 supervisor will eventually break down.  │
>> │Right now there is only 1 because only container │
>> │managers are using this feature. But if some daemon  │
>> │starts using it to isolate some thread, suddenly it  │
>> │might break if a container manager is trying to  │
>> │listen to it too, etc. I expect it won't be needed   │
>> │soon, but I do think it'll change.   │
>> │ │
>> └─┘
>>
>>   · In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF.  This return value
>> will trigger a notification event.
>>
>>2. In order that the supervisor can obtain notifications using
>>   the listening file descriptor, (a duplicate of) that file
>>   descriptor must be passed from the target to the supervisor.
>>   One way in which this could be done is by passing the file
>>   descriptor over a UNIX domain socket connection between the

Re: [PATCH v3] getdents.2: Use appropriate types

2020-10-29 Thread Michael Kerrisk (man-pages)
On 10/29/20 3:10 PM, Alejandro Colomar wrote:
> getdents():
> This function has no glibc wrapper.
> As such, we should use the same types the Linux kernel uses:
> Use 'long' as the return type.
> 
> getdents64():
> The glibc wrapper uses:
> ssize_t getdents64(int, void *, size_t);
> 
> Signed-off-by: Alejandro Colomar 

Thanks, Alex. Applied.

Cheers,

Michael


> ---
> 
> Hi Michael,
> 
> Sorry, I'm being a bit distracted these days :)
> It should be good enough now, I think.
> 
> Cheers,
> 
> Alex
> 
>  man2/getdents.2 | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/man2/getdents.2 b/man2/getdents.2
> index a187fbcef..ed3bb40b1 100644
> --- a/man2/getdents.2
> +++ b/man2/getdents.2
> @@ -33,14 +33,13 @@
>  getdents, getdents64 \- get directory entries
>  .SH SYNOPSIS
>  .nf
> -.BI "int getdents(unsigned int " fd ", struct linux_dirent *" dirp ,
> +.BI "long getdents(unsigned int " fd ", struct linux_dirent *" dirp ,
>  .BI " unsigned int " count );
>  .PP
>  .BR "#define _GNU_SOURCE" "/* See feature_test_macros(7) */"
>  .B #include 
>  .PP
> -.BI "int getdents64(unsigned int " fd ", struct linux_dirent64 *" dirp ,
> -.BI " unsigned int " count );
> +.BI "ssize_t getdents64(int " fd ", void *" dirp ", size_t " count );
>  .fi
>  .PP
>  .IR Note :
> @@ -282,7 +281,8 @@ struct linux_dirent {
>  int
>  main(int argc, char *argv[])
>  {
> -int fd, nread;
> +int fd;
> +long nread;
>  char buf[BUF_SIZE];
>  struct linux_dirent *d;
>  char d_type;
> @@ -301,7 +301,7 @@ main(int argc, char *argv[])
>  
>  printf("\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- nread=%d 
> \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\en", nread);
>  printf("inode#file type  d_reclen  d_off   d_name\en");
> -for (int bpos = 0; bpos < nread;) {
> +for (long bpos = 0; bpos < nread;) {
>  d = (struct linux_dirent *) (buf + bpos);
>  printf("%8ld  ", d\->d_ino);
>  d_type = *(buf + bpos + d\->d_reclen \- 1);
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/29/20 2:42 AM, Jann Horn wrote:
> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>  wrote:
>>static bool
>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>  char *path, size_t len)
>>{
>>char procMemPath[PATH_MAX];
>>
>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>> req->pid);
>>
>>int procMemFd = open(procMemPath, O_RDONLY);
>>if (procMemFd == -1)
>>errExit("\tS: open");
>>
>>/* Check that the process whose info we are accessing is still 
>> alive.
>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>   /proc/PID/mem file descriptor that we opened corresponds to the
>>   process for which we received a notification. If that process
>>   subsequently terminates, then read() on that file descriptor
>>   will return 0 (EOF). */
>>
>>checkNotificationIdIsValid(notifyFd, req->id);
>>
>>/* Read bytes at the location containing the pathname argument
>>   (i.e., the first argument) of the mkdir(2) call */
>>
>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>if (nread == -1)
>>errExit("pread");
> 
> As discussed at
> <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
> we need to re-check checkNotificationIdIsValid() after reading remote
> memory but before using the read value in any way. Otherwise, the
> syscall could in the meantime get interrupted by a signal handler, the
> signal handler could return, and then the function that performed the
> syscall could free() allocations or return (thereby freeing buffers on
> the stack).
> 
> In essence, this pread() is (unavoidably) a potential use-after-free
> read; and to make that not have any security impact, we need to check
> whether UAF read occurred before using the read value. This should
> probably be called out elsewhere in the manpage, too...
> 
> Now, of course, **reading** is the easy case. The difficult case is if
> we have to **write** to the remote process... because then we can't
> play games like that. If we write data to a freed pointer, we're
> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> that /proc/$pid/mem is originally intended for process debugging,
> including installing breakpoints, and will therefore happily write
> over "readonly" private mappings, such as typical mappings of
> executable code.)
> 
> So, h... I guess if anyone wants to actually write memory back to
> the target process, we'd better come up with some dedicated API for
> that, using an ioctl on the seccomp fd that magically freezes the
> target process inside the syscall while writing to its memory, or
> something like that? And until then, the manpage should have a big fat
> warning that writing to the target's memory is simply not possible
> (safely).

Thank you for your very clear explanation! It turned out to be 
trivially easy to demonstrate this issue with a slightly modified
version of my program.

As well as the change to the code example that I already mentioned
my reply of a few hours ago, I've added the following text to the 
page:

   Caveats regarding the use of /proc/[tid]/mem
   The discussion above noted the need to use the
   SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
   /proc/[tid]/mem file of the target to avoid the possibility of
   accessing the memory of the wrong process in the event that the
   target terminates and its ID is recycled by another (unrelated)
   thread.  However, the use of this ioctl(2) operation is also
   necessary in other situations, as explained in the following
   pargraphs.

   Consider the following scenario, where the supervisor tries to
   read the pathname argument of a target's blocked mount(2) system
   call:

   • From one of its functions (func()), the target calls mount(2),
 which triggers a user-space notification and causes the target
 to block.

   • The supervisor receives the notification, opens
 /proc/[tid]/mem, and (successfully) performs the
 SECCOMP_IOCTL_NOTIF_ID_VALID check.

   • The target receives a signal, which causes the mount(2) to
 abort.

   • The signal handler executes in the target, and returns.

   • Upon return from the handler, the execution of fu

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/29/20 2:42 AM, Jann Horn wrote:
> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>  wrote:
>>static bool
>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>  char *path, size_t len)
>>{
>>char procMemPath[PATH_MAX];
>>
>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>> req->pid);
>>
>>int procMemFd = open(procMemPath, O_RDONLY);
>>if (procMemFd == -1)
>>errExit("\tS: open");
>>
>>/* Check that the process whose info we are accessing is still 
>> alive.
>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>   /proc/PID/mem file descriptor that we opened corresponds to the
>>   process for which we received a notification. If that process
>>   subsequently terminates, then read() on that file descriptor
>>   will return 0 (EOF). */
>>
>>checkNotificationIdIsValid(notifyFd, req->id);
>>
>>/* Read bytes at the location containing the pathname argument
>>   (i.e., the first argument) of the mkdir(2) call */
>>
>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>if (nread == -1)
>>errExit("pread");
> 
> As discussed at
> <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
> we need to re-check checkNotificationIdIsValid() after reading remote
> memory but before using the read value in any way. Otherwise, the
> syscall could in the meantime get interrupted by a signal handler, the
> signal handler could return, and then the function that performed the
> syscall could free() allocations or return (thereby freeing buffers on
> the stack).
> 
> In essence, this pread() is (unavoidably) a potential use-after-free
> read; and to make that not have any security impact, we need to check
> whether UAF read occurred before using the read value. This should
> probably be called out elsewhere in the manpage, too...

Thanks very much for pointing me at this!

So, I want to conform that the fix to the code is as simple as
adding a check following the pread() call, something like:

[[
 ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
 if (nread == -1)
errExit("Supervisor: pread");
 
 if (nread == 0) {
fprintf(stderr, "\tS: pread() of /proc/PID/mem "
"returned 0 (EOF)\n");
exit(EXIT_FAILURE);
 }
 
 if (close(procMemFd) == -1)
errExit("Supervisor: close-/proc/PID/mem");
 
+/* Once again check that the notification ID is still valid. The
+   case we are particularly concerned about here is that just
+   before we fetched the pathname, the target's blocked system
+   call was interrupted by a signal handler, and after the handler
+   returned, the target carried on execution (past the interrupted
+   system call). In that case, we have no guarantees about what we
+   are reading, since the target's memory may have been arbitrarily
+   changed by subsequent operations. */
+
+if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
+return false;
+
 /* We have no guarantees about what was in the memory of the target
process. We therefore treat the buffer returned by pread() as
untrusted input. The buffer should be terminated by a null byte;
if not, then we will trigger an error for the target process. */
 
 if (strnlen(path, nread) < nread)
 return true;
]]

> Now, of course, **reading** is the easy case. The difficult case is if
> we have to **write** to the remote process... because then we can't
> play games like that. If we write data to a freed pointer, we're
> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> that /proc/$pid/mem is originally intended for process debugging,
> including installing breakpoints, and will therefore happily write
> over "readonly" private mappings, such as typical mappings of
> executable code.)
> 
> So, h... I guess if anyone wants to actually write memory back to
> the target process, we'd better come up with some dedicated API for
> that, using an ioctl on the seccomp fd that magically freezes the
> target process inside the syscall while writing to its memory, or
> something like that? And until then, the manpage should have a big fat
> warning that writi

Re: [PATCH v2] getdents.2: Use appropriate types

2020-10-29 Thread Michael Kerrisk (man-pages)
Hi Alex,

On Thu, 29 Oct 2020 at 14:42, Alejandro Colomar  wrote:
>
> getdents():
> This function has no glibc wrapper.
> As such, we should use the same types the Linux kernel uses:
> Use 'long' as the return type.
>
> getdents64():
> The glibc wrapper uses ssize_t for the return type,
> and 'size_t' for the count argument.

Take a look in the header file at the argument types for getdents64();
there's still some changes needed.

Thanks,

Michael


> Signed-off-by: Alejandro Colomar 
> ---
>  man2/getdents.2 | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/man2/getdents.2 b/man2/getdents.2
> index a187fbcef..e14627e6e 100644
> --- a/man2/getdents.2
> +++ b/man2/getdents.2
> @@ -33,14 +33,14 @@
>  getdents, getdents64 \- get directory entries
>  .SH SYNOPSIS
>  .nf
> -.BI "int getdents(unsigned int " fd ", struct linux_dirent *" dirp ,
> +.BI "long getdents(unsigned int " fd ", struct linux_dirent *" dirp ,
>  .BI " unsigned int " count );
>  .PP
>  .BR "#define _GNU_SOURCE" "/* See feature_test_macros(7) */"
>  .B #include 
>  .PP
> -.BI "int getdents64(unsigned int " fd ", struct linux_dirent64 *" dirp ,
> -.BI " unsigned int " count );
> +.BI "ssize_t getdents64(unsigned int " fd ", struct linux_dirent64 *" dirp ,
> +.BI " size_t " count );
>  .fi
>  .PP
>  .IR Note :
> @@ -282,7 +282,8 @@ struct linux_dirent {
>  int
>  main(int argc, char *argv[])
>  {
> -int fd, nread;
> +int fd;
> +long nread;
>  char buf[BUF_SIZE];
>  struct linux_dirent *d;
>  char d_type;
> @@ -301,7 +302,7 @@ main(int argc, char *argv[])
>
>  printf("\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- nread=%d 
> \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\en", nread);
>  printf("inode#file type  d_reclen  d_off   d_name\en");
> -for (int bpos = 0; bpos < nread;) {
> +for (long bpos = 0; bpos < nread;) {
>  d = (struct linux_dirent *) (buf + bpos);
>  printf("%8ld  ", d\->d_ino);
>  d_type = *(buf + bpos + d\->d_reclen \- 1);
> --
> 2.28.0
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] perf_event_open.2: update the man page with CAP_PERFMON related information

2020-10-27 Thread Michael Kerrisk (man-pages)
On Tue, 27 Oct 2020 at 18:10, Alexey Budankov
 wrote:
>
>
> On 27.10.2020 19:57, Michael Kerrisk (man-pages) wrote:
> > Hello Alexey,
> >
> > On 10/27/20 5:48 PM, Alexey Budankov wrote:
> >>
> >> Extend perf_event_open 2 man page with the information about
> >> CAP_PERFMON capability designed to secure performance monitoring
> >> and observability operation in a system according to the principle
> >> of least privilege [1] (POSIX IEEE 1003.1e, 2.2.2.39).
> >>
> >> [1] https://sites.google.com/site/fullycapable/, posix_1003.1e-990310.pdf
> >>
> >> Signed-off-by: Alexey Budankov 
> >
> > Thanks for this. I've applied. I have a few questions/comments below.
> >
> >> ---
> >>  man2/perf_event_open.2 | 32 ++--
> >>  1 file changed, 30 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
> >> index 4827a359d..9810bc554 100644
> >> --- a/man2/perf_event_open.2
> >> +++ b/man2/perf_event_open.2
> >> @@ -97,6 +97,8 @@ when running on the specified CPU.
> >>  .BR "pid == \-1" " and " "cpu >= 0"
> >>  This measures all processes/threads on the specified CPU.
> >>  This requires
> >> +.B CAP_PERFMON
> >> +(since Linux 5.8) or
> >>  .B CAP_SYS_ADMIN
> >>  capability or a
> >>  .I /proc/sys/kernel/perf_event_paranoid
> >> @@ -108,9 +110,11 @@ This setting is invalid and will return an error.
> >>  When
> >>  .I pid
> >>  is greater than zero, permission to perform this system call
> >> -is governed by a ptrace access mode
> >> +is governed by
> >> +.B CAP_PERFMON
> >> +(since Linux 5.9) and a ptrace access mode
> >
> > I want to check: did you really mean 5.9 here? (Everywhere else,
> > 5.8 is mentioned, but perhaps this change came in the next kernel
> > version.)
>
> Yes, it is not a typo. This thing was merged into v5.9.
>
> Thanks,
> Alexei

Thanks, Alexei!



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] perf_event_open.2: update the man page with CAP_PERFMON related information

2020-10-27 Thread Michael Kerrisk (man-pages)
Hello Alexey,

On 10/27/20 5:48 PM, Alexey Budankov wrote:
> 
> Extend perf_event_open 2 man page with the information about
> CAP_PERFMON capability designed to secure performance monitoring
> and observability operation in a system according to the principle
> of least privilege [1] (POSIX IEEE 1003.1e, 2.2.2.39).
> 
> [1] https://sites.google.com/site/fullycapable/, posix_1003.1e-990310.pdf
> 
> Signed-off-by: Alexey Budankov 

Thanks for this. I've applied. I have a few questions/comments below.

> ---
>  man2/perf_event_open.2 | 32 ++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
> index 4827a359d..9810bc554 100644
> --- a/man2/perf_event_open.2
> +++ b/man2/perf_event_open.2
> @@ -97,6 +97,8 @@ when running on the specified CPU.
>  .BR "pid == \-1" " and " "cpu >= 0"
>  This measures all processes/threads on the specified CPU.
>  This requires
> +.B CAP_PERFMON
> +(since Linux 5.8) or
>  .B CAP_SYS_ADMIN
>  capability or a
>  .I /proc/sys/kernel/perf_event_paranoid
> @@ -108,9 +110,11 @@ This setting is invalid and will return an error.
>  When
>  .I pid
>  is greater than zero, permission to perform this system call
> -is governed by a ptrace access mode
> +is governed by
> +.B CAP_PERFMON
> +(since Linux 5.9) and a ptrace access mode

I want to check: did you really mean 5.9 here? (Everywhere else,
5.8 is mentioned, but perhaps this change came in the next kernel 
version.)

>  .B PTRACE_MODE_READ_REALCREDS
> -check; see
> +check on older Linux versions; see
>  .BR ptrace (2).
>  .PP
>  The
> @@ -2925,6 +2929,8 @@ to hold the result.
>  This allows attaching a Berkeley Packet Filter (BPF)
>  program to an existing kprobe tracepoint event.
>  You need
> +.B CAP_PERFMON
> +(since Linux 5.8) or
>  .B CAP_SYS_ADMIN
>  privileges to use this ioctl.
>  .IP
> @@ -2967,6 +2973,8 @@ have multiple events attached to a tracepoint.
>  Querying this value on one tracepoint event returns the id
>  of all BPF programs in all events attached to the tracepoint.
>  You need
> +.B CAP_PERFMON
> +(since Linux 5.8) or
>  .B CAP_SYS_ADMIN
>  privileges to use this ioctl.
>  .IP
> @@ -3175,6 +3183,8 @@ it was expecting.
>  .TP
>  .B EACCES
>  Returned when the requested event requires
> +.B CAP_PERFMON
> +(since Linux 5.8) or
>  .B CAP_SYS_ADMIN
>  permissions (or a more permissive perf_event paranoid setting).
>  Some common cases where an unprivileged process
> @@ -3296,6 +3306,8 @@ setting is specified.
>  It can also happen, as with
>  .BR EACCES ,
>  when the requested event requires
> +.B CAP_PERFMON
> +(since Linux 5.8) or
>  .B CAP_SYS_ADMIN
>  permissions (or a more permissive perf_event paranoid setting).
>  This includes setting a breakpoint on a kernel address,
> @@ -3326,6 +3338,22 @@ The official way of knowing if
>  support is enabled is checking
>  for the existence of the file
>  .IR /proc/sys/kernel/perf_event_paranoid .
> +.PP
> +.B CAP_PERFMON
> +capability (since Linux 5.8) provides secure approach to
> +performance monitoring and observability operations in a system
> +according to the principal of least privilege (POSIX IEEE 1003.1e).
> +Accessing system performance monitoring and observability operations
> +using
> +.B CAP_PERFMON
> +rather than the much more powerful
> +.B CAP_SYS_ADMIN
> +excludes chances to misuse credentials and makes operations more secure.
> +.B CAP_SYS_ADMIN
> +usage for secure system performance monitoring and observability
> +is discouraged with respect to
> +.B CAP_PERFMON
> +capability.

Thank you for adding the above piece. That point of course
really needs to be emphasized!

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/2] system_data_types.7: Add 'off_t'

2020-10-27 Thread Michael Kerrisk (man-pages)
Hi Alex,

On Tue, 27 Oct 2020 at 16:25, Alejandro Colomar  wrote:
>
>
>
> On 2020-10-27 14:47, Michael Kerrisk (man-pages) wrote:
> > On 10/27/20 11:23 AM, Alejandro Colomar wrote:
> >> Hi Michael,
> >>
> >> On 2020-10-07 08:53, Michael Kerrisk (man-pages) wrote:
> >>> On 10/6/20 12:12 AM, Alejandro Colomar wrote:
> >>>> Signed-off-by: Alejandro Colomar 
> >>>
> >>> Hi Alex,
> >>>
> >>> Thanks, patch applied. And I trimmed the "See also" a little.
> >>> I'd hold off on documenting loff_t and off64_t for the
> >>> moment. As you note in another mail, the *lseek* man page
> >>> situation is a bit of a mess. I'm not yet sure what to do.
> >>
> >>
> >> I saw a TODO in the page about loff_t.
> >> Just wanted to ping you in case you forgot about it (I did).
> >
> > I didn't forget it exactly. I just don't know that I have the
> > inclination to do anything about the messy *llseek* pages.
> >
> > Thanks,
> >
> > Michael
> >
> >
>
>
> Hi Michael,
>
> I've been reading them to add loff_t and off64_t to sys_data_types.
> Now that I've read them (not too deep),
> I think that lseek64(3) is good enough,
> and maybe we should look for small details
> missing there but present on the others,
> and merge those to lseek64.3.
> And then keep links in the other pages pointing to lseek64.3.
>
> Any thoughts?

Those pages have a long history, and I confess to not understanding
all of the details of the history. Looking more closely at the pages,
I think they are good enough. Let's leave them alone. (I did apply one
patch just now.)

Thinking about it further, I don't think it's necessary to document
loff_t in system_data_types(7). No APIs in the current glibc headers
even use loff_t, as far as I can see. I'm not sure that 'off64_t'
really needs documenting there either.

Thanks,

Michael





--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/2] system_data_types.7: Add 'off_t'

2020-10-27 Thread Michael Kerrisk (man-pages)
On 10/27/20 11:23 AM, Alejandro Colomar wrote:
> Hi Michael,
> 
> On 2020-10-07 08:53, Michael Kerrisk (man-pages) wrote:
>> On 10/6/20 12:12 AM, Alejandro Colomar wrote:
>>> Signed-off-by: Alejandro Colomar 
>>
>> Hi Alex,
>>
>> Thanks, patch applied. And I trimmed the "See also" a little.
>> I'd hold off on documenting loff_t and off64_t for the
>> moment. As you note in another mail, the *lseek* man page
>> situation is a bit of a mess. I'm not yet sure what to do.
> 
> 
> I saw a TODO in the page about loff_t.
> Just wanted to ping you in case you forgot about it (I did).

I didn't forget it exactly. I just don't know that I have the
inclination to do anything about the messy *llseek* pages.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Inconsistent capability requirements for prctl_set_mm_exe_file()

2020-10-27 Thread Michael Kerrisk (man-pages)
Hello Nicolas, Cyrill, and others,

@Nicolas, your commit ebd6de6812387a changed the capability 
requirements for the prctl_set_mm_exe_file() operation from

ns_capable(CAP_SYS_ADMIN)

to

ns_capable(CAP_SYS_ADMIN) || ns_capable(CAP_CHECKPOINT_RESTORE).

That's fine I guess, but while looking at that change, I found
an anomaly.

The same prctl_set_mm_exe_file() functionality is also available
via the prctl() PR_SET_MM_EXE_FILE operation, which was added
by Cyrill's commit b32dfe377102ce668. However, there the 
prctl_set_mm_exe_file() operation is guarded by a check

capable(CAP_SYS_RESOURCE).

There are two things I note:

* The capability requirements are different in the two cases.
* In one case the checks are with ns_capable(), while in the 
  other case the check is with capable().

In both cases, the inconsistencies predate Nicolas's patch,
and appear to have been introduced in Kirill Tkhai's commit
4d28df6152aa3ff.

I'm not sure what is right, but those inconsistencies seem
seem odd, and presumably unintended. Similarly, I'm not
sure what fix, if any, should be applied. However, I thought
it worth mentioning these details, since the situation is odd
and surprising.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v1] perf_event_open.2: update the man page with CAP_PERFMON related information

2020-10-27 Thread Michael Kerrisk (man-pages)
Hi Alexei,

Would you be able to refresh this patch and resend please?

Thnks,

Michael

On Mon, 24 Aug 2020 at 22:17, Alexey Budankov
 wrote:
>
> Hi Michael,
>
> On 23.08.2020 20:28, Michael Kerrisk (man-pages) wrote:
> > Hello Alexey,
> >
> > Could you look at the question below and update the patch.
> >
> > On 2/17/20 9:18 AM, Alexey Budankov wrote:
> >>
> >> Extend perf_event_open 2 man page with the information about
> >> CAP_PERFMON capability designed to secure performance monitoring
> >> and observability operation in a system according to the principle
> >> of least privilege [1] (POSIX IEEE 1003.1e, 2.2.2.39).
> >>
> >> [1] https://sites.google.com/site/fullycapable/, posix_1003.1e-990310.pdf
> >>
> >> Signed-off-by: Alexey Budankov 
> >> ---
> >>   man2/perf_event_open.2 | 27 +++
> >>   1 file changed, 27 insertions(+)
> >>
> >> diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
> >> index 89d267c02..e9aab2ca1 100644
> >> --- a/man2/perf_event_open.2
> >> +++ b/man2/perf_event_open.2
> >> @@ -98,6 +98,8 @@ when running on the specified CPU.
> >>   .BR "pid == \-1" " and " "cpu >= 0"
> >>   This measures all processes/threads on the specified CPU.
> >>   This requires
> >> +.B CAP_PERFMON
> >> +or
> >>   .B CAP_SYS_ADMIN
> >>   capability or a
> >>   .I /proc/sys/kernel/perf_event_paranoid
> >> @@ -2920,6 +2922,8 @@ to hold the result.
> >>   This allows attaching a Berkeley Packet Filter (BPF)
> >>   program to an existing kprobe tracepoint event.
> >>   You need
> >> +.B CAP_PERFMON
> >> +or
> >>   .B CAP_SYS_ADMIN
> >>   privileges to use this ioctl.
> >>   .IP
> >> @@ -2962,6 +2966,8 @@ have multiple events attached to a tracepoint.
> >>   Querying this value on one tracepoint event returns the id
> >>   of all BPF programs in all events attached to the tracepoint.
> >>   You need
> >> +.B CAP_PERFMON
> >> +or
> >>   .B CAP_SYS_ADMIN
> >>   privileges to use this ioctl.
> >>   .IP
> >> @@ -3170,6 +3176,8 @@ it was expecting.
> >>   .TP
> >>   .B EACCES
> >>   Returned when the requested event requires
> >> +.B CAP_PERFMON
> >> +or
> >>   .B CAP_SYS_ADMIN
> >>   permissions (or a more permissive perf_event paranoid setting).
> >>   Some common cases where an unprivileged process
> >> @@ -3291,6 +3299,8 @@ setting is specified.
> >>   It can also happen, as with
> >>   .BR EACCES ,
> >>   when the requested event requires
> >> +.B CAP_PERFMON
> >> +or
> >>   .B CAP_SYS_ADMIN
> >>   permissions (or a more permissive perf_event paranoid setting).
> >>   This includes setting a breakpoint on a kernel address,
> >> @@ -3321,6 +3331,23 @@ The official way of knowing if
> >>   support is enabled is checking
> >>   for the existence of the file
> >>   .IR /proc/sys/kernel/perf_event_paranoid .
> >> +.PP
> >> +.B CAP_PERFMON
> >> +capability (since Linux X.Y) provides secure approach to
> >
> > What's the version?
>
> It's since Linux 5.8 .
>
> >
> >> +performance monitoring and observability operations in a system
> >> +according to the principal of least privilege (POSIX IEEE 1003.1e).
> >> +Accessing system performance monitoring and observability operations
> >> +using
> >> +.B CAP_PERFMON
> >> +capability singly, without the rest of
> >> +.B CAP_SYS_ADMIN
> >> +credentials, excludes chances to misuse the credentials and makes
> >
> > I think that wording like "using CAP_PERFMON rather than the much
> > more powerful CAP_SYS_ADMIN..."
>
> Sounds good to me like this, or similar:
>
> "Accessing system performance monitoring and observability operations
>  using CAP_PERFMON rather than the much more powerful CAP_SYS_ADMIN
>  excludes chances to misuse credentials and makes operations more
>  secure."
>
> >
> >> +the operations more secure.
> >> +.B CAP_SYS_ADMIN
> >> +usage for secure system performance monitoring and observability
> >> +is discouraged with respect to
> >> +.B CAP_PERFMON
> >> +capability.
> >>   .SH BUGS
> >>   The
> >>   .B F_SETOWN_EX
> >
> > Thanks,
> >
> > Michael
> >
>
> Thanks,
> Alexei
>
> P.S.
> I am on vacations till 08/31.
> Please expect delay in response.
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-27 Thread Michael Kerrisk (man-pages)
On 10/26/20 4:54 PM, Jann Horn wrote:
> On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
>  wrote:
[...]
>> I tried applying the patch below to vanilla 5.9.0.
>> (There's one typo: s/ENOTCON/ENOTCONN).
>>
>> It seems not to work though; when I send a signal to my test
>> target process that is sleeping waiting for the notification
>> response, the process enters the uninterruptible D state.
>> Any thoughts?
> 
> Ah, yeah, I think I was completely misusing the wait API. I'll go change that.
> 
> (Btw, in general, for reports about hangs like that, it can be helpful
> to have the contents of /proc/$pid/stack. And for cases where CPUs are
> spinning, the relevant part from the output of the "L" sysrq, or
> something like that.)

Thanks for the tipcs!

> Also, I guess we can probably break this part of UAPI after all, since
> the only user of this interface seems to currently be completely
> broken in this case anyway? So I think we want the other
> implementation without the ->canceled_reqs logic after all.

Okay.

> I'm a bit on the fence now on whether non-blocking mode should use
> ENOTCONN or not... I guess if we returned ENOENT even when there are
> no more listeners, you'd have to disambiguate through the poll()
> revents, which would be kinda ugly?

I must confess, I'm not quite clear on which two cases you 
are trying to distinguish. Can you elaborate?

> I'll try to turn this into a proper patch submission...

Thank you!!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Thanks for getting back to me.

On Mon, 26 Oct 2020 at 14:54, Tycho Andersen  wrote:
>
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> > Hi all (and especially Tycho and Sargun),
> >
> > Following review comments on the first draft (thanks to Jann, Kees,
> > Christian and Tycho), I've made a lot of changes to this page.
> > I've also added a few FIXMEs relating to outstanding API issues.
> > I'd like a second pass review of the page before I release it.
> > But also, this mail serves as a way of noting the outstanding API
> > issues.
> >
> > Tycho: I still have an outstanding question for you at [2].
> > [2] 
> > https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
>
> I don't have that thread in my inbox any more, but I can reply here:
> no, I don't know any users of this info, but I also don't anticipate
> knowing how people will all use this feature :)

Yes, but my questions were:

[[
[1] So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

[2] If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred?
]]

So, do I understand right in [1]? (The implication from your reply is
yes, but I want to be sure...)

For [2], my question was not about users, but *use cases*. The
question I asked myself is: why does the feature exist? Hence my
question [2] reworded: "when you designed this, did you have in mind
scenarios here the process doing the NOTIFY_SEND might need to test
for POLLOUT because it doesn't know whether a NOTIFY_RECV has
occurred?"

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi all (and especially Tycho and Sargun),

Following review comments on the first draft (thanks to Jann, Kees,
Christian and Tycho), I've made a lot of changes to this page.
I've also added a few FIXMEs relating to outstanding API issues.
I'd like a second pass review of the page before I release it.
But also, this mail serves as a way of noting the outstanding API
issues.

Tycho: I still have an outstanding question for you at [2].

Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

I've shown the rendered version of the page below. The page source
currently sits in a branch at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

At this point, I'm mainly interested in feedback about the FIXMEs,
some of which relate to the text of the page itself, while the
others relate to the various outstanding API issues. The first 
FIXME provides a small opportunity for some bikeshedding :-);


Thanks,

Michael

[1] 
https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
[2] 
https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/

=

SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

NAME
   seccomp_user_notif - Seccomp user-space notification mechanism

   ┌─┐
   │FIXME│
   ├─┤
   │Might "seccomp_unotify(2)" be a better name for this │
   │page?  It's slightly shorter to type, and perhaps│
   │reads better when spoken.│
   └─┘

SYNOPSIS
   #include 
   #include 
   #include 

   int seccomp(unsigned int operation, unsigned int flags, void *args);

   #include 

   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
 struct seccomp_notif *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
 struct seccomp_notif_resp *resp);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);

DESCRIPTION
   This page describes the user-space notification mechanism
   provided by the Secure Computing (seccomp) facility.  As well as
   the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
   SECCOMP_RET_USER_NOTIF action value, and the
   SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
   mechanism involves the use of a number of related ioctl(2)
   operations (described below).

   Overview
   In conventional usage of a seccomp filter, the decision about how
   to treat a system call is made by the filter itself.  By
   contrast, the user-space notification mechanism allows the
   seccomp filter to delegate the handling of the system call to
   another user-space process.  Note that this mechanism is
   explicitly not intended as a method implementing security policy;
   see NOTES.

   In the discussion that follows, the thread(s) on which the
   seccomp filter is installed is (are) referred to as the target,
   and the process that is notified by the user-space notification
   mechanism is referred to as the supervisor.

   A suitably privileged supervisor can use the user-space
   notification mechanism to perform actions on behalf of the
   target.  The advantage of the user-space notification mechanism
   is that the supervisor will usually be able to retrieve
   information about the target and the performed system call that
   the seccomp filter itself cannot.  (A seccomp filter is limited
   in the information it can obtain and the actions that it can
   perform because it is running on a virtual machine inside the
   kernel.)

   An overview of the steps performed by the target and the
   supervisor is as follows:

   1. The target establishes a seccomp filter in the usual manner,
  but with two differences:

  · The seccomp(2) flags argument includes the flag
SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
value of the (successful) seccomp(2) call is a new
"listening" file descriptor that can be used to receive
notifications.  Only one "listening" seccomp filter can be
installed for a thread.

┌─┐
│FIXME│
├─┤
│Is the last sentence above correct?  │
│ │
│Kees Cook (25 Oct 2020) notes:   │
│ │
│I like this 

Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi Jann,

On 10/26/20 10:32 AM, Jann Horn wrote:
> On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/17/20 2:25 AM, Jann Horn wrote:
>>> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
>>>  wrote:
> [...]
>>>> I'm not sure if I should write anything about this small UAPI
>>>> breakage in BUGS, or not. Your thoughts?
>>>
>>> Thinking about it a bit more: Any code that relies on pause() or
>>> epoll_wait() not restarting is buggy anyway, right? Because a signal
>>> could also arrive directly before entering the syscall, while
>>> userspace code is still executing? So one could argue that we're just
>>> enlarging a preexisting race. (Unless the signal handler checks the
>>> interrupted register state to figure out whether we already entered
>>> syscall handling?)
>>
>> Yes, that all makes sense.
>>
>>> If userspace relies on non-restarting behavior, it should be using
>>> something like epoll_pwait(). And that stuff only unblocks signals
>>> after we've already past the seccomp checks on entry.
>>
>> Thanks for elaborating that detail, since as soon as you talked
>> about "enlarging a preexisting race" above, I immediately wondered
>> sigsuspend(), pselect(), etc.
>>
>> (Mind you, I still wonder about the effect on system calls that
>> are normally nonrestartable because they have timeouts. My
>> understanding is that the kernel doesn't restart those system
>> calls because it's impossible for the kernel to restart the call
>> with the right timeout value. I wonder what happens when those
>> system calls are restarted in the scenario we're discussing.)
> 
> Ah, that's an interesting edge case...

I'm going to drop a FIXME into the page source so that
there's a reminder of this issue in the next draft of 
the page, which I'm about to send out.

[...]

Thanks for checking the other pieces, Jann.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Michael Kerrisk (man-pages)
Hello Kees,

On 10/26/20 1:19 AM, Kees Cook wrote:
> On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 10/1/20 1:39 AM, Kees Cook wrote:
>>> I'll comment more later, but I've run out of time today and I didn't see
>>> anyone mention this detail yet in the existing threads... :)
>>
>> Later never came :-). But, I hope you may have comments for the 
>> next draft, which I will send out soon.
> 
> Later is now, and Soon approaches!
> 
> I finally caught up and read through this whole thread. Thank you all
> for the bug fix[1], and I'm looking forward to more[2]. :)


> For my reply I figured I'd base it on the current draft, so here's a
> simulated quote based on the seccomp_user_notif branch of
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
> through commit 71101158fe330af5a26552447a0bb433b69e15b7
> $ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

Thanks for reviewing the latest version!

> On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual   SECCOMP_USER_NOTIF(2)
>>
>> NAME
>>seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>#include 
>>#include 
>>#include 
>>
>>int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>>#include 
>>
>>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>>  struct seccomp_notif *req);
>>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>>  struct seccomp_notif_resp *resp);
>>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>>
>> DESCRIPTION
>>This page describes the user-space notification mechanism provided
>>by the Secure Computing (seccomp) facility.  As well as the use of
>>the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>>SECCOMP_RET_USER_NOTIF action value, and the
>>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>>mechanism involves the use of a number of related ioctl(2)
>>operations (described below).
>>
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to treat a system call is made by the filter itself.  By contrast,
>>the user-space notification mechanism allows the seccomp filter to
>>delegate the handling of the system call to another user-space
>>process.  Note that this mechanism is explicitly not intended as a
>>method implementing security policy; see NOTES.
>>
>>In the discussion that follows, the thread(s) on which the seccomp
>>filter is installed is (are) referred to as the target, and the
>>process that is notified by the user-space notification mechanism
>>is referred to as the supervisor.
>>
>>A suitably privileged supervisor can use the user-space
>>notification mechanism to perform actions on behalf of the target.
>>The advantage of the user-space notification mechanism is that the
>>supervisor will usually be able to retrieve information about the
>>target and the performed system call that the seccomp filter
>>itself cannot.  (A seccomp filter is limited in the information it
>>can obtain and the actions that it can perform because it is
>>running on a virtual machine inside the kernel.)
>>
>>An overview of the steps performed by the target and the
>>supervisor is as follows:
>>
>>1. The target establishes a seccomp filter in the usual manner,
>>   but with two differences:
>>
>>   • The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>> value  of the (successful) seccomp(2) call is a new
> 
> nit: extra space

Thanks. Fixed.

>> "listening" file descriptor that can be used to receive
>> notifications.  Only one "listening" seccomp filter can be
>> installed for a thread.
> 
> I like this limitation, but I expect that it'll need to change in the
> future. Even with LSMs, we see the need for arbitrary stacking, and the
> idea of there being only 1 supervisor will eventually break down. Right
> now there is only 1 because only container managers are using this
> feature. But if some daemon starts using it to isolate some thread,
> suddenly it might break if a

Re: For review: seccomp_user_notif(2) manual page

2020-10-25 Thread Michael Kerrisk (man-pages)
Hi Jann,

On 10/1/20 4:14 AM, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 3:52 AM Jann Horn  wrote:
>> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
>>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
>>>> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
>>>>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
>>>>> wrote:
>>>>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>>>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
>>>>>>> wrote:
>>>>>>>>┌─┐
>>>>>>>>│FIXME│
>>>>>>>>├─┤
>>>>>>>>│From my experiments,  it  appears  that  if  a  SEC‐ │
>>>>>>>>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>>>>>>>│process terminates, then the ioctl()  simply  blocks │
>>>>>>>>│(rather than returning an error to indicate that the │
>>>>>>>>│target process no longer exists).│
>>>>>>>
>>>>>>> Yeah, I think Christian wanted to fix this at some point,
>>>>>>
>>>>>> Do you have a pointer that discussion? I could not find it with a
>>>>>> quick search.
>>>>>>
>>>>>>> but it's a
>>>>>>> bit sticky to do.
>>>>>>
>>>>>> Can you say a few words about the nature of the problem?
>>>>>
>>>>> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
>>>>> notify about unused filter"). So maybe there's a bug here?
>>>>
>>>> That thing only notifies on ->poll, it doesn't unblock ioctls; and
>>>> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
>>>> commit doesn't have any effect on this kind of usage.
>>>
>>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
>>> we don't have a count of all of them, unfortunately.
>>>
>>> We could maybe look inside the wait_list, but that will probably make
>>> people angry :)
>>
>> The easiest way would probably be to open-code the semaphore-ish part,
>> and let the semaphore and poll share the waitqueue. The current code
>> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
>> entire semaphore would IMO be cleaner than that. And it's not like
>> semaphore semantics are even a good fit for this code anyway.
>>
>> Let's see... if we didn't have the existing UAPI to worry about, I'd
>> do it as follows (*completely* untested). That way, the ioctl would
>> block exactly until either there actually is a request to deliver or
>> there are no more users of the filter. The problem is that if we just
>> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
>> an event loop and don't set O_NONBLOCK will be screwed. So we'd
>> probably also have to add some stupid counter in place of the
>> semaphore's counter that we can use to preserve the old behavior of
>> returning -ENOENT once for each cancelled request. :(
>>
>> I guess this is a nice point in favor of Michael's usual complaint
>> that if there are no man pages for a feature by the time the feature
>> lands upstream, there's a higher chance that the UAPI will suck
>> forever...
> 
> And I guess this would be the UAPI-compatible version - not actually
> as terrible as I thought it might be. Do y'all want this? If so, feel
> free to either turn this into a proper patch with Co-developed-by, or
> tell me that I should do it and I'll try to get around to turning it
> into something proper.

Thanks for taking a shot at this.

I tried applying the patch below to vanilla 5.9.0.
(There's one typo: s/ENOTCON/ENOTCONN).

It seems not to work though; when I send a signal to my test
target process that is sleeping waiting for the notification
response, the process enters the uninterruptible D state.
Any thoughts?

Thanks,

Michael

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..d08c453fcc2c 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,7 @@ struct seccomp_kaddfd {
>   * @notifications: A list of struct seccomp_knotif elements.
>   */
>  struct notification {
> -   struct semaphore req

Re: For review: seccomp_user_notif(2) manual page

2020-10-24 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/17/20 2:25 AM, Jann Horn wrote:
> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/15/20 10:32 PM, Jann Horn wrote:
>>> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
>>>  wrote:
>>>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>>>  wrote:
>>>>>> I knew it would be a big ask, but below is kind of the manual page
>>>>>> I was hoping you might write [1] for the seccomp user-space notification
>>>>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>>>>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>>>>>> that also will need documenting [2]), I did :-). But of course I may
>>>>>> have made mistakes...
>>> [...]
>>>>>>3. The supervisor process will receive notification events on the
>>>>>>   listening  file  descriptor.   These  events  are  returned as
>>>>>>   structures of type seccomp_notif.  Because this structure  and
>>>>>>   its  size may evolve over kernel versions, the supervisor must
>>>>>>   first determine the size of  this  structure  using  the  sec‐
>>>>>>   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>>>>>   structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>>>>>   cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>>>>>   to receive notification events.   In  addition,the  supervisor
>>>>>>   allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>>>>>   comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>>>>>   comp_notif_resp  structure) that it will provide to the kernel
>>>>>>   (and thus the target process).
>>>>>>
>>>>>>4. The target process then performs its workload, which  includes
>>>>>>   system  calls  that  will be controlled by the seccomp filter.
>>>>>>   Whenever one of these system calls causes the filter to return
>>>>>>   the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>>>>>   execute the system call;  instead,  execution  of  the  target
>>>>>>   process is temporarily blocked inside the kernel and a notifi‐
>>>>>
>>>>> where "blocked" refers to the interruptible, restartable kind - if the
>>>>> child receives a signal with an SA_RESTART signal handler in the
>>>>> meantime, it'll leave the syscall, go through the signal handler, then
>>>>> restart the syscall again and send the same request to the supervisor
>>>>> again. so the supervisor may see duplicate syscalls.
>>>>
>>>> So, I partially demonstrated what you describe here, for two example
>>>> system calls (epoll_wait() and pause()). But I could not exactly
>>>> demonstrate things as I understand you to be describing them. (So,
>>>> I'm not sure whether I have not understood you correctly, or
>>>> if things are not exactly as you describe them.)
>>>>
>>>> Here's a scenario (A) that I tested:
>>>>
>>>> 1. Target installs seccomp filters for a blocking syscall
>>>>(epoll_wait() or pause(), both of which should never restart,
>>>>regardless of SA_RESTART)
>>>> 2. Target installs SIGINT handler with SA_RESTART
>>>> 3. Supervisor is sleeping (i.e., is not blocked in
>>>>SECCOMP_IOCTL_NOTIF_RECV operation).
>>>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>>>> 5. SIGINT gets delivered to target; handler gets called;
>>>>***and syscall gets restarted by the kernel***
>>>>
>>>> That last should never happen, of course, and is a result of the
>>>> combination of both the user-notify filter and the SA_RESTART flag.
>>>> If one or other is not present, then the system call is not
>>>> restarted.
>>>>
>>>> So, as you note below, the UAPI gets broken a little.
>>>>
>>>> However, from your description above I had understood that
>>>> something like the following scenario (B) could occur:
>>>>
>>>> 1. Target inst

Re: For review: seccomp_user_notif(2) manual page

2020-10-16 Thread Michael Kerrisk (man-pages)
Hello Jann,

Thanks for your reply!

On 10/15/20 10:32 PM, Jann Horn wrote:
> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>  wrote:
>>>> I knew it would be a big ask, but below is kind of the manual page
>>>> I was hoping you might write [1] for the seccomp user-space notification
>>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>>>> that also will need documenting [2]), I did :-). But of course I may
>>>> have made mistakes...
> [...]
>>>>3. The supervisor process will receive notification events on the
>>>>   listening  file  descriptor.   These  events  are  returned as
>>>>   structures of type seccomp_notif.  Because this structure  and
>>>>   its  size may evolve over kernel versions, the supervisor must
>>>>   first determine the size of  this  structure  using  the  sec‐
>>>>   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>>>   structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>>>   cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>>>   to receive notification events.   In  addition,the  supervisor
>>>>   allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>>>   comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>>>   comp_notif_resp  structure) that it will provide to the kernel
>>>>   (and thus the target process).
>>>>
>>>>4. The target process then performs its workload, which  includes
>>>>   system  calls  that  will be controlled by the seccomp filter.
>>>>   Whenever one of these system calls causes the filter to return
>>>>   the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>>>   execute the system call;  instead,  execution  of  the  target
>>>>   process is temporarily blocked inside the kernel and a notifi‐
>>>
>>> where "blocked" refers to the interruptible, restartable kind - if the
>>> child receives a signal with an SA_RESTART signal handler in the
>>> meantime, it'll leave the syscall, go through the signal handler, then
>>> restart the syscall again and send the same request to the supervisor
>>> again. so the supervisor may see duplicate syscalls.
>>
>> So, I partially demonstrated what you describe here, for two example
>> system calls (epoll_wait() and pause()). But I could not exactly
>> demonstrate things as I understand you to be describing them. (So,
>> I'm not sure whether I have not understood you correctly, or
>> if things are not exactly as you describe them.)
>>
>> Here's a scenario (A) that I tested:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>>(epoll_wait() or pause(), both of which should never restart,
>>regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor is sleeping (i.e., is not blocked in
>>SECCOMP_IOCTL_NOTIF_RECV operation).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. SIGINT gets delivered to target; handler gets called;
>>***and syscall gets restarted by the kernel***
>>
>> That last should never happen, of course, and is a result of the
>> combination of both the user-notify filter and the SA_RESTART flag.
>> If one or other is not present, then the system call is not
>> restarted.
>>
>> So, as you note below, the UAPI gets broken a little.
>>
>> However, from your description above I had understood that
>> something like the following scenario (B) could occur:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>>(epoll_wait() or pause(), both of which should never restart,
>>regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>>blocks).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. Supervisor gets seccomp user-space notification (i.e.,
>>SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
>> 6. SIGINT gets delivered to target; handler gets called;
>>and syscall gets restarted by the kernel
>> 7. S

Re: [PATCH 4/5] Add manpage for fsopen(2) and fsmount(2)

2020-10-16 Thread Michael Kerrisk (man-pages)
Hi David,

Another ping for these five patches please!

Cheers,

Michael

On Fri, 11 Sep 2020 at 14:44, Michael Kerrisk (man-pages)
 wrote:
>
> Hi David,
>
> A ping for these five patches please!
>
> Cheers,
>
> Michael
>
> On Wed, 2 Sep 2020 at 22:14, Michael Kerrisk (man-pages)
>  wrote:
> >
> > On Wed, 2 Sep 2020 at 18:14, David Howells  wrote:
> > >
> > > Michael Kerrisk (man-pages)  wrote:
> > >
> > > > The term "filesystem configuration context" is introduced, but never
> > > > really explained. I think it would be very helpful to have a sentence
> > > > or three that explains this concept at the start of the page.
> > >
> > > Does that need a .7 manpage?
> >
> > I was hoping a sentence or a paragraph in this page might suffice. Do
> > you think more is required?
> >
> > Cheers,
> >
> > Michael
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Michael Kerrisk (man-pages)
Hello Christian,

On 10/1/20 2:36 PM, Christian Brauner wrote:
> [I'm on vacation so I'll just give this a quick glance for now.]
> 
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho, Sargun (and all),
>>
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces 
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
>> that also will need documenting [2]), I did :-). But of course I may 
>> have made mistakes...
>>
>> I've shown the rendered version of the page below, and would love
>> to receive review comments from you and others, and acks, etc.
>>
>> There are a few FIXMEs sprinkled into the page, including one
>> that relates to what appears to me to be a misdesign (possibly 
>> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
>> operation. I would be especially interested in feedback on that
>> FIXME, and also of course the other FIXMEs.
>>
>> The page includes an extensive (albeit slightly contrived)
>> example program, and I would be happy also to receive comments
>> on that program.
>>
>> The page source currently sits in a branch (along with the text
>> that you sent me for the seccomp(2) page) at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> Thanks,
>>
>> Michael
>>
>> [1] 
>> https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
>> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> =
>>
>> NAME
>>seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>#include 
>>#include 
>>#include 
>>
>>int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>> DESCRIPTION
>>This  page  describes  the user-space notification mechanism pro‐
>>vided by the Secure Computing (seccomp) facility.  As well as the
>>use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>>COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>>operation  described  in  seccomp(2), this mechanism involves the
>>use of a number of related ioctl(2) operations (described below).
>>
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to  treat  a particular system call is made by the filter itself.
>>The user-space notification mechanism allows the handling of  the
>>system  call  to  instead  be handed off to a user-space process.
> 
> "In contrast, the user notification mechanism allows to delegate the
> handling of the system call of one process (target) to another
> user-space process (supervisor)."?

Thanks. I've reworded similarly to what you suggest.

>>The advantages of doing this are that, by contrast with the  sec‐
>>comp  filter,  which  is  running on a virtual machine inside the
>>kernel, the user-space process has access to information that  is
>>unavailable to the seccomp filter and it can perform actions that
>>can't be performed from the seccomp filter.
> 
> This section reads a bit difficult imho:
> "A suitably privileged supervisor can use the user notification
> mechanism to perform actions in lieu of the target. The supervisor will
> usually be able to retrieve information about the target and the
> performed system call that the seccomp filter itself cannot."

Thanks. Again I've done some rewording.

>>In the discussion that follows, the process  that  has  installed
>>the  seccomp filter is referred to as the target, and the process
>>that is notified by  the  user-space  notification  mechanism  is
>>referred  to  as  the  supervisor.  An overview of the steps per‐
>>formed by these two processes is as follows:

After the various rewordings, the opening paragraphs now read:

   In conventional usage of a seccomp filter, the decision about  how
   to treat a system call is made by the filter itself.  By contrast,
   the user-space notification mechanism allows the seccomp filter to
   delegate  the  handling  of  the system call to another user-space
   process.

   In the discussion that follows, the thread(s) on which the seccomp
 

Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Michael Kerrisk (man-pages)
Hi Jann,

So, first off, thank you for the detailed review. I really 
appreciate it! I've changed various pieces, and still have
a few questions below.

On 9/30/20 5:53 PM, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>  wrote:
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>> that also will need documenting [2]), I did :-). But of course I may
>> have made mistakes...
> [...]
>> NAME
>>seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>#include 
>>#include 
>>#include 
>>
>>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?

Yes, good idea. I added:

   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
 struct seccomp_notif *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
 struct seccomp_notif_resp *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
>> DESCRIPTION
>>This  page  describes  the user-space notification mechanism pro‐
>>vided by the Secure Computing (seccomp) facility.  As well as the
>>use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>>COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>>operation  described  in  seccomp(2), this mechanism involves the
>>use of a number of related ioctl(2) operations (described below).
>>
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to  treat  a particular system call is made by the filter itself.
>>The user-space notification mechanism allows the handling of  the
>>system  call  to  instead  be handed off to a user-space process.
>>The advantages of doing this are that, by contrast with the  sec‐
>>comp  filter,  which  is  running on a virtual machine inside the
>>kernel, the user-space process has access to information that  is
>>unavailable to the seccomp filter and it can perform actions that
>>can't be performed from the seccomp filter.
>>
>>In the discussion that follows, the process  that  has  installed
>>the  seccomp filter is referred to as the target, and the process
> 
> Technically, this definition of "target" is a bit inaccurate because:
> 
>  - seccomp filters are inherited
>  - seccomp filters apply to threads, not processes
>  - seccomp filters can be semi-remotely installed via TSYNC

(Nice summary.)

> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)

Exactly.

> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?

Thanks. It's always hugely helpful to get a suggested wording, even
if I still feel the need to rework it (which I don't in this case).
The sentence now reads:

   In the discussion that follows, the thread(s) on which the seccomp
   filter is installed are referred to as the target, and the process
   that is notified  by  the  user-space  notification  mechanism  is
   referred to as the supervisor.

>>that is notified by  the  user-space  notification  mechanism  is
>>referred  to  as  the  supervisor.  An overview of the steps per‐
>>formed by these two processes is as follows:
>>
>>1. The target process establishes a seccomp filter in  the  usual
>>   manner, but with two differences:
>>
>>   · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>> TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
>> the  (successful)  seccomp(2) call is a new "listening" file
>> descriptor that can be used to receive notifications.
>>
>>   · In cases where it is appropriate, the seccomp filter returns
>> the  action value SECCOMP_RET_USER_NOTIF.  This return value
>> will trigger a notification event.
>>
>>2. In order that the supervisor process can obtain  notifications
>>   using  the  listening  file  descriptor, (a duplicate of) that
>>   file descriptor must 

Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Michael Kerrisk (man-pages)
Hello Kees,

On 10/1/20 1:39 AM, Kees Cook wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> [...] I did :-)
> 
> Yay! Thank you!

You're welcome :-)

>> [...]
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to  treat  a particular system call is made by the filter itself.
>>The user-space notification mechanism allows the handling of  the
>>system  call  to  instead  be handed off to a user-space process.
>>The advantages of doing this are that, by contrast with the  sec‐
>>comp  filter,  which  is  running on a virtual machine inside the
>>kernel, the user-space process has access to information that  is
>>unavailable to the seccomp filter and it can perform actions that
>>can't be performed from the seccomp filter.
> 
> I might clarify a bit with something like (though maybe the
> target/supervisor paragraph needs to be moved to the start):
> 
>   This is used for performing syscalls on behalf of the target,
>   rather than having the supervisor make security policy decisions
>   about the syscall, which would be inherently race-prone. The
>   target's syscall should either be handled by the supervisor or
>   allowed to continue normally in the kernel (where standard security
>   policies will be applied).

You, Christian, and Jann all pulled me up on this point. And thanks; 
I'm going to use some of your words above. See my reply to Jann, sent
at about the same time as this reply. Please take a look at the text
in my reply to Jann, and let me know what you think.

> I'll comment more later, but I've run out of time today and I didn't see
> anyone mention this detail yet in the existing threads... :)

Later never came :-). But, I hope you may have comments for the 
next draft, which I will send out soon.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-14 Thread Michael Kerrisk (man-pages)
On 10/1/20 7:12 PM, Christian Brauner wrote:
> On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
>> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
>>> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>>>  wrote:
>>>> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
>>>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>>>  wrote:
>>>>>> NOTES
>>>>>>The file descriptor returned when seccomp(2) is employed with the
>>>>>>SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>>>>>>poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>>>>>>ing,  these interfaces indicate that the file descriptor is read‐
>>>>>>able.
>>>>>
>>>>> We should probably also point out somewhere that, as
>>>>> include/uapi/linux/seccomp.h says:
>>>>>
>>>>>  * Similar precautions should be applied when stacking 
>>>>> SECCOMP_RET_USER_NOTIF
>>>>>  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>>>>>  * same syscall, the most recently added filter takes precedence. This 
>>>>> means
>>>>>  * that the new SECCOMP_RET_USER_NOTIF filter can override any
>>>>>  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>>>>>  * such filtered syscalls to be executed by sending the response
>>>>>  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
>>>>> equally
>>>>>  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>>>>
>>>>> In other words, from a security perspective, you must assume that the
>>>>> target process can bypass any SECCOMP_RET_USER_NOTIF (or
>>>>> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
>>>>> calling seccomp(). This should also be noted over in the main
>>>>> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>>>>
>>>> So I was actually wondering about this when I skimmed this and a while
>>>> ago but forgot about this again... Afaict, you can only ever load a
>>>> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
>>>> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
>>>> in the tasks filter hierarchy then the kernel will refuse to load a new
>>>> one?
>>>>
>>>> static struct file *init_listener(struct seccomp_filter *filter)
>>>> {
>>>> struct file *ret = ERR_PTR(-EBUSY);
>>>> struct seccomp_filter *cur;
>>>>
>>>> for (cur = current->seccomp.filter; cur; cur = cur->prev) {
>>>> if (cur->notif)
>>>> goto out;
>>>> }
>>>>
>>>> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
>>>> override each other for the same task simply because there can only ever
>>>> be a single one?
>>>
>>> Good point. Excpt that that check seems ineffective because this
>>> happens before we take the locks that guard against TSYNC, and also
>>> before we decide to which existing filter we want to chain the new
>>> filter. So if two threads race with TSYNC, I think they'll be able to
>>> chain two filters with listeners together.
>>
>> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
>> be totally effective,
>>
>>> I don't know whether we want to eternalize this "only one listener
>>> across all the filters" restriction in the manpage though, or whether
>>> the man page should just say that the kernel currently doesn't support
>>> it but that security-wise you should assume that it might at some
>>> point.
>>
>> This requirement originally came from Andy, arguing that the semantics
>> of this were/are confusing, which still makes sense to me. Perhaps we
>> should do something like the below?
> 
> I think we should either keep up this restriction and then cement it in
> the manpage or add a flag to indicate that the notifier is
> non-overridable.
> I don't care about the default too much, i.e. whether it's overridable
> by default and exclusive if opting in or the other way around doesn't
> matter too much. But from a supervisor's perspectiv

Re: For review: seccomp_user_notif(2) manual page

2020-10-14 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Ping on the question below!

Thanks,

Michael

On 10/1/20 9:45 AM, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:03 AM, Tycho Andersen wrote:
>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Tycho,
>>>
>>> Thanks for taking time to look at the page!
>>>
>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
>>>> wrote:
> 
> [...]
> 
>>>>>┌─┐
>>>>>│FIXME│
>>>>>├─┤
>>>>>│Interestingly, after the event  had  been  received, │
>>>>>│the  file descriptor indicates as writable (verified │
>>>>>│from the source code and by experiment). How is this │
>>>>>│useful?  │
>>>>
>>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>>> reasonable.
>>>
>>> No, I'm saying something more fundamental: why is the FD indicating as
>>> writable? Can you write something to it? If yes, what? If not, then
>>> why do these APIs want to say that the FD is writable?
>>
>> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
>> NOTIFY_SEND are reading and writing events from the fd. I don't know
>> that much about the poll interface though -- is it possible to
>> indicate "here's a pseudo-read event"? It didn't look like it, so I
>> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.
> 
> I think the POLLIN thing is fine.
> 
> So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
> 
> If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred? 
> 
> Thanks,
> 
> Michael
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs

2020-10-13 Thread Michael Kerrisk (man-pages)
Hello Linus,

On 10/13/20 12:30 AM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
>  wrote:
>>
>> I don't think this is correct. The epoll(7) manual page
>> sill carries the text written long ago by Davide Libenzi,
>> the creator of epoll:
>>
>> Since  even with edge-triggered epoll, multiple events can be gen‐
>> erated upon receipt of multiple chunks of data, the caller has the
>> option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>> the associated file descriptor after the receipt of an event  with
>> epoll_wait(2).
> 
> Hmm.
> 
> The more I read that paragraph, the more I think the epoll man-page
> really talks about something that _could_ happen due to internal
> implementation details, but that isn't really something an epoll user
> would _want_ to happen or depend on.
> 
> IOW, in that whole "even with edge-triggered epoll, multiple events
> can be generated", I'd emphasize the *can* part (as in "might", not as
> in "will"), and my reading is that the reason EPOLLONESHOT flag exists
> is to avoid that whole "this is implementation-defined, and if you
> absolutely _must_ get just a single event, you need to use
> EPOLLONESHOT to make sure you remove yourself after you got the one
> single event you waited for".

I agree that that is also a valid alternate reading of the text, 
in particular, "can" could be read as "might" rather than "will".

I also agree that the semantics before the change were odd
(but see [3]).

But...

> The corollary of that reading is that the new pipe behavior is
> actually the _expected_ one, and the old pipe behavior where we would
> generate multiple events is the unwanted implementation detail of
> "this might still happen, and if you care, you will need to do extra
> stuff".

"expected" by who? I mean, there were established semantics
for pipes/FIFOs in this scenario. Those semantics changed in
Linux 5.5.

However, those established EPOLLET semantics are still (I tested
each of these) followed by:

* Sockets (tested in Internet domain)
* Terminals
* POSIX message queues
* Hierarchical epoll instances; for example:
  - epoll FD X monitors epoll FD Y with EPOLLET
  - epoll FD Y monitors two FDs, A and B, for EPOLLIN
  - input arrives on FD A
  - epoll_wait on X returns EPOLLIN for FD Y
  - next epoll_wait on X doesn't inform us that Y is ready
  - input arrives on B
  - epoll_wait on X returns EPOLLIN for FD Y

I would say that users *expect* at least the following:

* That semantics don't change unexpectedly.
* That semantics are consistent.

In Linux 5.5, the pipe EPOLLET semantics changed unexpectedly.
And now, pipes have EPOLLET semantics that are inconsistent with
every other type of FD (that I tested).

> Anyway, I don't absolutely hate that patch of mine, but it does seem
> nonsensical and pointless, and I think I'll just hold off on applying
> it until we hear of something actually breaking.

The problem is that sometimes it takes a very long time to hear
of something breaking. For example, a Linux 3.5 regression in
the POSIX message queue API was only fixed in 3.14 [1], and only
after the breakage was reported as a man-pages bug(!) a year
after the breakage.

And sometimes, if things don't get fixed soon enough, then
any fix will break new users. Thus we now have F_SETOWN_EX
(2.6.32) to do what F_SETOWN used to do before a regression
that occurred about 4 years earlier (2.6.12) (see [2]), because
reverting the F_SETOWN semantics to what they originally 
were might have broken some new apps that had appeared in
those four years.

> Which I suspect simply won't happen. Getting two epoll notifications
> when the pipe state didn't really change in between is not something I
> can see anybody really depending on.
> 
> You _will_ get the second notification if somebody actually emptied
> the pipe in between, and you have a real new "edge".
> 
> But hey, I am continually surprised by what user space code then
> occasionally does, despite my fairly low expectations.

Yes, user space code does surprising things. But, give people
enough time and every detail of API behavior will come
to be depended upon by someone. We don't know if anyone
depends on the old pipe EPOLLET behavior. I also imagine the
chances are small, but if users do depend on it, they are
in for an unpleasant surprise (missed notifications).

We can all agree that the existing EPOLLET are perhaps strange.
However, why change these semantics just for pipes? In other
words, given my notes above about consistency, what is the
argument for not applying the patch? IOW, I think "consistency"
is a rather stronger argument than "but it seems nonsensical
and pointless&q

Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs

2020-10-12 Thread Michael Kerrisk (man-pages)
On 10/12/20 10:52 PM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
>  wrote:
>>
>> [CC += Davide]
> 
> I'm not sure how active Davide is any more..

Yep, I know. But just in case.

>> I don't think this is correct. The epoll(7) manual page
>> sill carries the text written long ago by Davide Libenzi,
>> the creator of epoll:
>>
>> Since  even with edge-triggered epoll, multiple events can be gen‐
>> erated upon receipt of multiple chunks of data, the caller has the
>> option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>> the associated file descriptor after the receipt of an event  with
>> epoll_wait(2).
>>
>> My reading of that text is that in the scenario that I describe a
>> readiness notification should be generated at step 3 (and indeed
>> should be generated whenever additional data bleeds into the channel).
> 
> Hmm.
> 
> That is unfortunate, because it basically exposes an internal wait
> queue implementation decision, not actual real semantics.

I don't disagree that the longstanding semantics are a little odd;
your comment explains perhaps why.

> I suspect it's easy enough to "fix" the regression with the attached
> patch. It's pretty nonsensical, but I guess there's not a lot of
> downside - if the pipe wasn't empty, there normally shouldn't be any
> non-epoll readers anyway.
> 
> I'm busy merging, mind testing this odd patch out? It is _entirely_
> untested, but from the symptoms I think it's the obvious fix.

Applied against current master (13cb73490f475). My test now
runs as I expected.

> I did the same thing for the "reader starting out from a full pipe" case too.

I haven't tested this, but thanks for thinking of it.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs

2020-10-12 Thread Michael Kerrisk (man-pages)
[CC += Davide]

Hello Linus,

Thanks for your quick reply.

On 10/12/20 9:25 PM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 11:40 AM Michael Kerrisk (man-pages)
>  wrote:
>>
>> Between Linux 5.4 and 5.5 a regression was introduced in the operation
>> of the epoll EPOLLET flag. From some manual bisecting, the regression
>> appears to have been introduced in
>>
>>  commit 1b6b26ae7053e4914181eedf70f2d92c12abda8a
>>  Author: Linus Torvalds 
>>  Date:   Sat Dec 7 12:14:28 2019 -0800
>>
>>  pipe: fix and clarify pipe write wakeup logic
>>
>> (I also built a kernel from the  immediate preceding commit, and did
>> not observe the regression.)
> 
> So the difference from that commit is that now we only wake up a
> reader of a pipe when we add data to it AND IT WAS EMPTY BEFORE.
> 
>> The aim of ET (edge-triggered) notification is that epoll_wait() will
>> tell us a file descriptor is ready only if there has been new activity
>> on the FD since we were last informed about the FD. So, in the
>> following scenario where the read end of a pipe is being monitored
>> with EPOLLET, we see:
>>
>> [Write a byte to write end of pipe]
>> 1. Call epoll_wait() ==> tells us pipe read end is ready
>> 2. Call epoll_wait() [again] ==> does not tell us that the read end of
>> pipe is ready
> 
> Right.
> 
>> If we go further:
>>
>> [Write another byte to write end of pipe]
>> 3. Call epoll_wait() ==> tells us pipe read end is ready
> 
> No.
> 
> The "read end" readiness has not changed. It was ready before, it's
> ready now, there's no change in readiness.
> 
> Now, the old pipe behavior was that it would wake up writers whether
> they needed it or not, so epoll got woken up even if the readiness
> didn't actually change.
> 
> So we do have a change in behavior.
> 
> However, clearly your test is wrong, and there is no edge difference.
> 
> Now, if this is more than just a buggy test - and it actually breaks
> some actual application and real behavior - we'll need to fix it. A
> regression is a regression, and we'll need to be bug-for-bug
> compatible for people who depended on bugs.

I don't think this is correct. The epoll(7) manual page
sill carries the text written long ago by Davide Libenzi,
the creator of epoll:

Since  even with edge-triggered epoll, multiple events can be gen‐
erated upon receipt of multiple chunks of data, the caller has the
option  to specify the EPOLLONESHOT flag, to tell epoll to disable
the associated file descriptor after the receipt of an event  with
epoll_wait(2).

My reading of that text is that in the scenario that I describe a
readiness notification should be generated at step 3 (and indeed
should be generated whenever additional data bleeds into the channel).
Indeed, the very rationale for the existence of the EPOLLONESHOT flag
is to *prevent* notifications in such circumstances. And, as I noted,
sockets and terminals do (still) behave in the way that I expect in
this scenario.

So, I don't think this is a buggy test. It (still) appears to me
that this is a breakage of intended and documented behavior.
(Whether it breaks some actual application, I do not know. But
I have also seen that sometimes reports of such breakages take
a very time to come in.)

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs

2020-10-12 Thread Michael Kerrisk (man-pages)
Hello Linus,

Between Linux 5.4 and 5.5 a regression was introduced in the operation
of the epoll EPOLLET flag. From some manual bisecting, the regression
appears to have been introduced in

 commit 1b6b26ae7053e4914181eedf70f2d92c12abda8a
 Author: Linus Torvalds 
 Date:   Sat Dec 7 12:14:28 2019 -0800

 pipe: fix and clarify pipe write wakeup logic

(I also built a kernel from the  immediate preceding commit, and did
not observe the regression.)

The aim of ET (edge-triggered) notification is that epoll_wait() will
tell us a file descriptor is ready only if there has been new activity
on the FD since we were last informed about the FD. So, in the
following scenario where the read end of a pipe is being monitored
with EPOLLET, we see:

[Write a byte to write end of pipe]
1. Call epoll_wait() ==> tells us pipe read end is ready
2. Call epoll_wait() [again] ==> does not tell us that the read end of
pipe is ready

(By contrast, in step 2, level-triggered notification would tell
us the read end of the pipe is read.)

If we go further:

[Write another byte to write end of pipe]
3. Call epoll_wait() ==> tells us pipe read end is ready

The above was true until the regression. Now, step 3 does not tell us
that the pipe read end is ready, even though there is NEW input
available on the pipe. (In the analogous situation for sockets and
terminals, step 3 does (still) correctly tell us that the FD is
ready.)

I've appended a test program below. The following are the results on
kernel 5.4.0:

$ ./pipe_epollet_test
Writing a byte to pipe()
1: OK:   ret = 1, events = [ EPOLLIN ]
2: OK:   ret = 0
Writing a byte to pipe()
3: OK:   ret = 1, events = [ EPOLLIN ]
Closing write end of pipe()
4: OK:   ret = 1, events = [ EPOLLIN EPOLLHUP ]

On current kernels, the results are as follows:

$ ./pipe_epollet_test
Writing a byte to pipe()
1: OK:   ret = 1, events = [ EPOLLIN ]
2: OK:   ret = 0
Writing a byte to pipe()
3: FAIL: ret = 0; EXPECTED: ret = 1, events = [ EPOLLIN ]
Closing write end of pipe()
4: OK:   ret = 1, events = [ EPOLLIN EPOLLHUP ]

Thanks,

Michael

=

/* pipe_epollet_test.c

   Copyright (c) 2020, Michael Kerrisk 

   Licensed under GNU GPLv2 or later.
*/
#include 
#include 
#include 
#include 
#include 

#define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

static void
printMask(int events)
{
printf(" [ %s%s]",
(events & EPOLLIN)  ? "EPOLLIN "  : "",
(events & EPOLLHUP) ? "EPOLLHUP " : "");
}

static void
doEpollWait(int epfd, int timeout, int expectedRetval, int expectedEvents)
{
struct epoll_event ev;
static int callNum = 0;

int retval = epoll_wait(epfd, , 1, timeout);
if (retval == -1) {
perror("epoll_wait");
return;
}

/* The test succeeded if (1) we got the expected return value and
   (2) when the return value was 1, we got the expected events mask */

bool succeeded = retval == expectedRetval &&
(expectedRetval == 0 || expectedEvents == ev.events);

callNum++;
printf("%d: ", callNum);

if (succeeded)
printf("OK:   ");
else
printf("FAIL: ");

printf("ret = %d", retval);

if (retval == 1) {
printf(", events =");
printMask(ev.events);
}

if (!succeeded) {
printf("; EXPECTED: ret = %d", expectedRetval);
if (expectedRetval == 1) {
printf(", events =");
printMask(expectedEvents);
}
}
printf("\n");
}

int
main(int argc, char *argv[])
{
int epfd;
int pfd[2];

epfd = epoll_create(1);
if (epfd == -1)
errExit("epoll_create");

/* Create a pipe and add read end to epoll interest list */

if (pipe(pfd) == -1)
errExit("pipe");

struct epoll_event ev;
ev.data.fd = pfd[0];
ev.events = EPOLLIN | EPOLLET;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, pfd[0], ) == -1)
errExit("epoll_ctl");

/* Run some tests */

printf("Writing a byte to pipe()\n");
write(pfd[1], "a", 1);

doEpollWait(epfd, 0, 1, EPOLLIN);
doEpollWait(epfd, 0, 0, 0);

printf("Writing a byte to pipe()\n");
write(pfd[1], "a", 1);

doEpollWait(epfd, 0, 1, EPOLLIN);

printf("Closing write end of pipe()\n");
close(pfd[1]);

doEpollWait(epfd, 0, 1, EPOLLIN | EPOLLHUP);

exit(EXIT_SUCCESS);
}


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 2/2] off_t.3: New link to system_data_types(7)

2020-10-07 Thread Michael Kerrisk (man-pages)
On 10/6/20 12:12 AM, Alejandro Colomar wrote:
> Signed-off-by: Alejandro Colomar 

Thanks, Alex. Patch applied.

Cheers,

Michael

> ---
>  man3/off_t.3 | 1 +
>  1 file changed, 1 insertion(+)
>  create mode 100644 man3/off_t.3
> 
> diff --git a/man3/off_t.3 b/man3/off_t.3
> new file mode 100644
> index 0..db50c0f09
> --- /dev/null
> +++ b/man3/off_t.3
> @@ -0,0 +1 @@
> +.so man7/system_data_types.7
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/2] system_data_types.7: Add 'off_t'

2020-10-07 Thread Michael Kerrisk (man-pages)
On 10/6/20 12:12 AM, Alejandro Colomar wrote:
> Signed-off-by: Alejandro Colomar 

Hi Alex,

Thanks, patch applied. And I trimmed the "See also" a little.
I'd hold off on documenting loff_t and off64_t for the 
moment. As you note in another mail, the *lseek* man page
situation is a bit of a mess. I'm not yet sure what to do.

Thanks,

Michael

> ---
>  man7/system_data_types.7 | 50 
>  1 file changed, 50 insertions(+)
> 
> diff --git a/man7/system_data_types.7 b/man7/system_data_types.7
> index b8cbc8ffe..916efef08 100644
> --- a/man7/system_data_types.7
> +++ b/man7/system_data_types.7
> @@ -629,6 +629,56 @@ C99 and later; POSIX.1-2001 and later.
>  See also:
>  .BR lldiv (3)
>  .RE
> +.\"- off_t /
> +.TP
> +.I off_t
> +.RS
> +Include:
> +.IR  .
> +Alternatively,
> +.IR  ,
> +.IR  ,
> +.IR  ,
> +.IR  ,
> +.IR  ,
> +or
> +.IR  .
> +.PP
> +Used for file sizes.
> +According to POSIX,
> +this shall be a signed integer type.
> +.PP
> +Versions:
> +.I 
> +and
> +.I 
> +define
> +.I off_t
> +since POSIX.1-2008.
> +.PP
> +Conforming to:
> +POSIX.1-2001 and later.
> +.PP
> +See also:
> +.BR fallocate (2),
> +.BR lseek (2),
> +.BR mmap (2),
> +.BR mmap2 (2),
> +.BR posix_fadvise (2),
> +.BR pread (2),
> +.BR preadv (2),
> +.BR truncate (2),
> +.BR fseeko (3),
> +.BR getdirentries (3),
> +.BR lockf (3),
> +.BR posix_fallocate (3)
> +.\".PP   TODO: loff_t, off64_t
> +.\"See also the
> +.\".I loff_t
> +.\"and
> +.\".I off64_t
> +.\"types in this page.
> +.RE
>  .\"- pid_t /
>  .TP
>  .I pid_t
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Navigational corrections

2020-10-07 Thread Michael Kerrisk (man-pages)
On 10/6/20 12:08 AM, Alejandro Colomar wrote:
> Hi Michael,
> 
> On 2020-10-03 13:39, Michael Kerrisk (man-pages) wrote:
>> Hi Alex,
> [...]
>>
>> off_t would be great.
>>
>> In case you are looking for some other candidates, some others
>> that I would be interested to see go into the page would be
>>
>> fd_set
>> clock_t
>> clockid_t
>> and probably dev_t
> 
> Great!
> 
> off_t is almost done.  I think I have too many references in "See also".
> 
> I'll send you the patch, and trim as you want :)

Thanks, Alex. I'm teaching a course this week, so less active, 
I'm sorry.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v4 1/2] system_data_types.7: Add 'void *'

2020-10-03 Thread Michael Kerrisk (man-pages)
On 10/3/20 9:48 AM, G. Branden Robinson wrote:
> At 2020-10-03T09:10:14+0200, Michael Kerrisk (man-pages) wrote:
>> On 10/2/20 10:27 PM, Alejandro Colomar wrote:
>>> On 2020-10-02 22:14, Paul Eggert wrote:
>>>  > On 10/2/20 11:38 AM, Alejandro Colomar wrote:
>>>  >
>>>  >> .I void *
>>>  >>
>>>  >> renders with a space in between.
>>>  >
>>>  > That's odd, as "man(7)" says "All of the arguments will be
>>>  > printed next to each other without intervening spaces". I'd play
>>>  > it safe and quote the arg anyway.
>>>
>>> Oops, that's a bug in man(7).  Don't worry about it.
>>
>> I'm not sure where that text in man(7) comes from. However, for
>> clarity I would normally also use quotes in this case.
>>
>>> Michael, you might want to have a look at it.
>>>
>>> I'll also add Branden, who might have something to say about it.
>>
>> Yes, maybe Branden can add some insight.
> 
> The "short" answer[1] is that I think Alex is correct; Paul's caution is
> unwarranted and arises from confusion with the font alternation macros
> of the man(7) macro package.  Examples of the latter are .BI and .BR.
> Those set their even-numbered arguments in one font and odd-numbered
> arguments in another, with no space between them.  That suppression of
> space is the reason they exist.  With the "single-font" macros like .B
> and .I[2], if you don't want space, don't type it.
> 
> I could say more, including an annotated explanation of the groff and
> Version 7 Unix man(7) implementations of the I macro, if desired.  :)

So, perhaps change:

   All  of the arguments will be printed next to each
   other without intervening spaces, so that  the  .BR  command
   can  be used to specify a word in bold followed by a mark of
   punctuation in Roman.

to:

   For the macros that produce alternating type faces,
   the arguments will be printed next to each
   other without intervening spaces, so that  the  .BR  command
   can  be used to specify a word in bold followed by a mark of
   punctuation in Roman.

?

> [1] since as everyone knows, I struggle with brevity
> [2] I (and others) discourage use of .SM and .SB because they can't be
> distinguished from ordinary roman and bold type, respectively, on
> terminals.

So, do you think it's worth discouraging this in man(7)?

Thanks,

Michael
 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Navigational corrections

2020-10-03 Thread Michael Kerrisk (man-pages)
Hi Alex,


>  >
>  > The question of 'void *' is an interesting one. It is something
>  > like a fundamental C type, and not something that comes from POSIX.
>  > But, it does appear in POSIX APIs and often details of using
>  > the type are not well understood. So, as a matter of practicality,
>  > and again since you've done the work, I am inclined to include
>  > this type in the page, just so it can be handily referred to
>  > along with all of the other types.
>  >
>  > Looking ahead (and I hope none of the above disheartens you,
>  > since you've done a lot of great work for this page),
> 
> Actually, not.
> Its good to have you tell me what is good for the man and what's not.
> Otherwise, I wouldn't know.
> I keep a branch with all of the rejected patches,
> just to have an idea of what I should not send you :-)
> 
>  > it would
>  > be good if you could provide a bit of an advance roadmap about
>  > the types that you'd like to add to the page.
> 
> Well, I didn't have a clear roadmap.
> I had some types which I clearly wanted to document,
> and they were ptrdiff_t, and ssize_t,
> which I documented in the first patches,
> and then I was finding related types,
> and also tended to document about types which I knew very well too,
> to have something useful to add to the description.
> 
> I may now start writing about off_t and related types,
> which were the ones that made me want this page.

off_t would be great.

In case you are looking for some other candidates, some others
that I would be interested to see go into the page would be

fd_set
clock_t
clockid_t
and probably dev_t


Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v5 2/2] void.3: New link to system_data_types(7)

2020-10-03 Thread Michael Kerrisk (man-pages)
Hello Alex,

On 10/2/20 9:28 PM, Alejandro Colomar wrote:
> Signed-off-by: Alejandro Colomar 

Patch applied.

And, I think we're now at a sync point.

Thanks,

Michael


> ---
>  man3/void.3 | 1 +
>  1 file changed, 1 insertion(+)
>  create mode 100644 man3/void.3
> 
> diff --git a/man3/void.3 b/man3/void.3
> new file mode 100644
> index 0..db50c0f09
> --- /dev/null
> +++ b/man3/void.3
> @@ -0,0 +1 @@
> +.so man7/system_data_types.7
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v5 1/2] system_data_types.7: Add 'void *'

2020-10-03 Thread Michael Kerrisk (man-pages)
Hello Alex,

On 10/2/20 9:28 PM, Alejandro Colomar wrote:
> Signed-off-by: Alejandro Colomar 

Patch applied.

Thanks,

Michael


> system_data_types.7: void *: Add info about generic function parameters and 
> return value
> 
> Reported-by: Paul Eggert 
> Reported-by: David Laight 
> Signed-off-by: Alejandro Colomar 
> 
> system_data_types.7: void *: Add info about pointer artihmetic
> 
> Reported-by: Paul Eggert 
> Reported-by: David Laight 
> Signed-off-by: Alejandro Colomar 
> 
> system_data_types.7: void *: Add Versions notes
> 
> Compatibility between function pointers and void * hasn't always been so.
> Document when that was added to POSIX.
> 
> Reported-by: Michael Kerrisk 
> Signed-off-by: Alejandro Colomar 
> 
> system_data_types.7: void *: wfix
> 
> Reported-by: Jonathan Wakely 
> Reported-by: Paul Eggert 
> Signed-off-by: Alejandro Colomar 
> ---
>  man7/system_data_types.7 | 76 ++--
>  1 file changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/man7/system_data_types.7 b/man7/system_data_types.7
> index c82d3b388..7c1198802 100644
> --- a/man7/system_data_types.7
> +++ b/man7/system_data_types.7
> @@ -679,7 +679,6 @@ See also the
>  .I uintptr_t
>  and
>  .I void *
> -.\" TODO: Document void *
>  types in this page.
>  .RE
>  .\"- lconv /
> @@ -1780,7 +1779,6 @@ See also the
>  .I intptr_t
>  and
>  .I void *
> -.\" TODO: Document void *
>  types in this page.
>  .RE
>  .\"- va_list --/
> @@ -1814,6 +1812,80 @@ See also:
>  .BR va_copy (3),
>  .BR va_end (3)
>  .RE
> +.\"- void * ---/
> +.TP
> +.I void *
> +.RS
> +According to the C language standard,
> +a pointer to any object type may be converted to a pointer to
> +.I void
> +and back.
> +POSIX further requires that any pointer,
> +including pointers to functions,
> +may be converted to a pointer to
> +.I void
> +and back.
> +.PP
> +Conversions from and to any other pointer type are done implicitly,
> +not requiring casts at all.
> +Note that this feature prevents any kind of type checking:
> +the programmer should be careful not to convert a
> +.I void *
> +value to a type incompatible to that of the underlying data,
> +because that would result in undefined behavior.
> +.PP
> +This type is useful in function parameters and return value
> +to allow passing values of any type.
> +The function will typically use some mechanism to know
> +the real type of the data being passed via a pointer to
> +.IR void .
> +.PP
> +A value of this type can't be dereferenced,
> +as it would give a value of type
> +.IR void ,
> +which is not possible.
> +Likewise, pointer arithmetic is not possible with this type.
> +However, in GNU C, pointer arithmetic is allowed
> +as an extension to the standard;
> +this is done by treating the size of a
> +.I void
> +or of a function as 1.
> +A consequence of this is that
> +.I sizeof
> +is also allowed on
> +.I void
> +and on function types, and returns 1.
> +.PP
> +The conversion specifier for
> +.I void *
> +for the
> +.BR printf (3)
> +and the
> +.BR scanf (3)
> +families of functions is
> +.BR p .
> +.PP
> +Versions:
> +The POSIX requirement about compatibility between
> +.I void *
> +and function pointers was added in
> +POSIX.1-2008 Technical Corrigendum 1 (2013).
> +.PP
> +Conforming to:
> +C99 and later; POSIX.1-2001 and later.
> +.PP
> +See also:
> +.BR malloc (3),
> +.BR memcmp (3),
> +.BR memcpy (3),
> +.BR memset (3)
> +.PP
> +See also the
> +.I intptr_t
> +and
> +.I uintptr_t
> +types in this page.
> +.RE
>  .\"/
>  .SH NOTES
>  The structures described in this manual page shall contain,
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Navigational corrections (was: Re: [PATCH v2 1/2] system_data_types.7: Add 'void *')

2020-10-03 Thread Michael Kerrisk (man-pages)
Hi Alex, et al.
On 10/2/20 3:51 PM, Alejandro Colomar wrote:
> 
> Hi Jonathan,
> 
> On 2020-10-02 15:27, Jonathan Wakely wrote:
>> On Fri, 2 Oct 2020 at 14:20, Alejandro Colomar  
>> wrote:
>>>
>>>
>>>
>>> On 2020-10-02 15:06, Jonathan Wakely wrote:
>>>   > On Fri, 2 Oct 2020 at 12:31, Michael Kerrisk (man-pages)
>>>   >  wrote:
>>>   >>
>>>   >> On Fri, 2 Oct 2020 at 12:49, Jonathan Wakely 
>>> wrote:
>>>   >>>
>>>   >>> On Fri, 2 Oct 2020 at 09:28, Alejandro Colomar via Gcc
>>>  wrote:
>>>   >>>> However, it might be good that someone starts a page called
>>>   >>>> 'type_qualifiers(7)' or something like that.
>>>   >>>
>>>   >>> Who is this for? Who is trying to learn C from man pages? Should
>>>   >>> somebody stop them?
>>>   >>
>>>   >> Yes, I think so. To add context, Alex has been doing a lot of work to
>>>   >> build up the new system_data_types(7) page [1], which I think is
>>>   >> especially useful for the POSIX system data types that are used with
>>>   >> various APIs.
>>>   >
>>>   > It's definitely useful for types like struct siginfo_t and struct
>>>   > timeval, which aren't in C.
>>>
>>> Hi Jonathan,
>>>
>>> But then the line is a bit diffuse.
>>> Would you document 'ssize_t' and not 'size_t'?
>>
>> Yes. My documentation for ssize_t would mention size_t, refer to the C
>> standard, and not define it.
>>
>>> Would you not document intN_t types?
>>> Would you document stdint types, including 'intptr_t', and not 'void *'?
>>
>> I would document neither.
>>
>> I can see some small value in documenting size_t and the stdint types,
>> as they are technically defined by the libc headers. But documenting
>> void* seems very silly. It's one of the most fundamental built-in
>> parts of the C language, not an interface provided by the system.
>>
>>> I guess the basic types (int, long, ...) can be left out for now,
>>
>> I should hope so!
>>
>>> and apart from 'int' those rarely are the most appropriate types
>>> for most uses.
>>> But other than that, I would document all of the types.
>>> And even... when all of the other types are documented,
>>> it will be only a little extra effort to document those,
>>> so in the future I might consider that.
>>
>> [insert Jurassic Park meme "Your scientists were so preoccupied with
>> whether or not they could, they didn't stop to think if they should."
>> ]
>>
>> I don't see value in bloating the man-pages with information nobody
>> will ever use, and which doesn't (IMHO) belong there anyway. We seem
>> to fundamentally disagree about what the man pages are for. I don't
>> think they are supposed to teach C programming from scratch.
> 
> Agree in part.
> I'll try to think about it again.
> 
> In the meantime, I trust Michael to tell me when something is way off :)
> 
> Thanks, really!
> 
> Alex

So, I think a navigational correction is needed.

My vision was that system_data_types(7) would most usefully document 
the POSIX types, but by now there's too much of C creeping in. I have
been a little slow to react to that, and I apologize for that.
But I think we should not go in that direction

I think it is worth having types like ssize_t and size_t in 
the page, simply because they turn up with so many of the POSIX
APIs, and people often don't understand some details of these
types (such as the necessary prinf() specifiers). So, as long as
we're going to have a page about these types, it's fine by
me to include size_t and ssize_t.

Types like [u]intN_t are definitely on the borderline for me. But,
they do appear in various APIs in the Linux interface (either
explicitly, or as the similar __u32 ___64, etc.). And again
many people don't understand some basic details, such as
the PRI and SCN constants, so I think it is useful to have them
briefly summarized in one place, and as long as they are already
in the page, then let's keep them.

I think __int128 etc definitely doesn't belong in this page.

And I'd like to back pedal a bit. I think we really shouldn't have
[u]int_fastN_t
[u]int_leastN_t
in the page. They are C details that have nothing to with POSIX, 
the kernel, or libc. Could you send me a patch to remove these
from the page? And again, my apologies for not being focused 
enough on the big picture sooner.

I don't think 'void' belongs in this page. Nor basic types
such as int, 

Re: [PATCH v4 1/2] system_data_types.7: Add 'void *'

2020-10-03 Thread Michael Kerrisk (man-pages)
On 10/2/20 10:27 PM, Alejandro Colomar wrote:
> Hi Paul,
> 
> On 2020-10-02 22:14, Paul Eggert wrote:
>  > On 10/2/20 11:38 AM, Alejandro Colomar wrote:
>  >
>  >> .I void *
>  >>
>  >> renders with a space in between.
>  >
>  > That's odd, as "man(7)" says "All of the arguments will be printed next
>  > to each other without intervening spaces". I'd play it safe and quote
>  > the arg anyway.
> 
> Oops, that's a bug in man(7).
> Don't worry about it.

I'm not sure where that text in man(7) comes from. However, for clarity
I would normally also use quotes in this case.

> Michael, you might want to have a look at it.
> 
> I'll also add Branden, who might have something to say about it.

Yes, maybe Branden can add some insight.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 1/2] system_data_types.7: Add 'void *'

2020-10-02 Thread Michael Kerrisk (man-pages)
Hi Alex,

On 10/2/20 10:48 AM, Alejandro Colomar wrote:
> Hi Michael,
> 
> On 2020-10-02 10:24, Alejandro Colomar wrote:
>> On 2020-10-01 19:32, Paul Eggert wrote:
>>  > For 'void *' you should also mention that one cannot use arithmetic on
>>  > void * pointers, so they're special in that way too.
>>
>> Good suggestion!
>>
>>  > Also, you should
>>  > warn that because one can convert from any pointer type to void * and
>>  > then to any other pointer type, it's a deliberate hole in C's
>>  > type-checking.
>>
>> Also good.  I'll talk about generic function parameters for this.
> I think the patch as is now is complete enough to be added.
> 
> So I won't rewrite it for now.
> Please review the patch as is,
> and I'll add more info to this type in the future.

Actually, I would rather prefer one patch series, rather than 
patches on patches please. It also makes review of the overall
'void *' text easier if it's all one patch. So, If you could
squash the patches together and resubmit, that would be helful.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 1/2] system_data_types.7: Add 'void *'

2020-10-02 Thread Michael Kerrisk (man-pages)
On Fri, 2 Oct 2020 at 12:49, Jonathan Wakely  wrote:
>
> On Fri, 2 Oct 2020 at 09:28, Alejandro Colomar via Gcc  
> wrote:
> > However, it might be good that someone starts a page called
> > 'type_qualifiers(7)' or something like that.
>
> Who is this for? Who is trying to learn C from man pages? Should
> somebody stop them?

Yes, I think so. To add context, Alex has been doing a lot of work to
build up the new system_data_types(7) page [1], which I think is
especially useful for the POSIX system data types that are used with
various APIs. With the addition of the integer types and 'void *'
things are straying somewhat from POSIX into C. I think there is value
in saying something about those types, but I'm somewhat neutral about
their inclusion in the page. But Alex has done the work, and I'm
willing to include those types in the page.

I do think that something like type_qualifiers(7) strays over the line
of what should be covered in Linux man-pages, which are primarily
about the kernel + libc APIs. [2]

Thanks,

Michael

[1] 
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man7/system_data_types.7
[2] Mind you, man-pages trayed over the line already very many years
ago with operators(7), because who ever remembers all of the C
operator precedences.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Michael Kerrisk (man-pages)
On 10/1/20 3:52 AM, Jann Horn wrote:

[...]

> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

Thanks for saving me the trouble of saying that (again).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Michael Kerrisk (man-pages)
On 10/1/20 1:03 AM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho,
>>
>> Thanks for taking time to look at the page!
>>
>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:

[...]

>>>>┌─┐
>>>>│FIXME│
>>>>├─┤
>>>>│Interestingly, after the event  had  been  received, │
>>>>│the  file descriptor indicates as writable (verified │
>>>>│from the source code and by experiment). How is this │
>>>>│useful?  │
>>>
>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>> reasonable.
>>
>> No, I'm saying something more fundamental: why is the FD indicating as
>> writable? Can you write something to it? If yes, what? If not, then
>> why do these APIs want to say that the FD is writable?
> 
> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
> NOTIFY_SEND are reading and writing events from the fd. I don't know
> that much about the poll interface though -- is it possible to
> indicate "here's a pseudo-read event"? It didn't look like it, so I
> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

I think the POLLIN thing is fine.

So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred? 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Thanks for taking time to look at the page!

On 9/30/20 5:03 PM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>>2. In order that the supervisor process can obtain  notifications
>>   using  the  listening  file  descriptor, (a duplicate of) that
>>   file descriptor must be passed from the target process to  the
>>   supervisor process.  One way in which this could be done is by
>>   passing the file descriptor over a UNIX domain socket  connec‐
>>   tion between the two processes (using the SCM_RIGHTS ancillary
>>   message type described in unix(7)).   Another  possibility  is
>>   that  the  supervisor  might  inherit  the file descriptor via
>>   fork(2).
> 
> It is technically possible to inherit the fd via fork, but is it
> really that useful? The child process wouldn't be able to actually do
> the syscall in question, since it would have the same filter.

D'oh! Yes, of course.

I think I was reaching because in an earlier conversation
you replied:

[[
> 3. The "target process" passes the "listening file descriptor"
>to the "monitoring process" via the UNIX domain socket.

or some other means, it doesn't have to be with SCM_RIGHTS.
]]

So, what other means?

Anyway, I removed the sentence mentioning fork().

>>   The  information  in  the notification can be used to discover
>>   the values of pointer arguments for the target process's  sys‐
>>   tem call.  (This is something that can't be done from within a
>>   seccomp filter.)  To do this (and  assuming  it  has  suitable
> 
> s/To do this/One way to accomplish this/ perhaps, since there are
> others.

Yes, thanks, done.

>>   permissions),   the   supervisor   opens   the   corresponding
>>   /proc/[pid]/mem file, seeks to the memory location that corre‐
>>   sponds to one of the pointer arguments whose value is supplied
>>   in the notification event, and reads bytes from that location.
>>   (The supervisor must be careful to avoid a race condition that
>>   can occur when doing this; see the  description  of  the  SEC‐
>>   COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>>   tion, the supervisor can access other system information  that
>>   is  visible  in  user space but which is not accessible from a
>>   seccomp filter.
>>
>>   ┌─┐
>>   │FIXME│
>>   ├─┤
>>   │Suppose we are reading a pathname from /proc/PID/mem │
>>   │for  a system call such as mkdir(). The pathname can │
>>   │be an arbitrary length. How do we know how much (how │
>>   │many pages) to read from /proc/PID/mem?  │
>>   └─┘
> 
> PATH_MAX, I suppose.

Yes, I misunderstood a fundamental detail here, as Jann 
also confirmed.

>>┌─┐
>>│FIXME│
>>├─┤
>>│From my experiments,  it  appears  that  if  a  SEC‐ │
>>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>│process terminates, then the ioctl()  simply  blocks │
>>│(rather than returning an error to indicate that the │
>>│target process no longer exists).│
> 
> Yeah, I think Christian wanted to fix this at some point,

Do you have a pointer that discussion? I could not find it with a 
quick search.

> but it's a
> bit sticky to do.

Can you say a few words about the nature of the problem?

In the meantime. I think this merits a note under BUGS, and
I've added one.

> Note that if you e.g. rely on fork() above, the
> filter is shared with your current process, and this notification
> would never be possible. Perhaps another reason to omit that from the
> man page.

(Yes, as noted above, I removed that sentence.)

>>SECCOMP_IOCTL_NOTIF_ID_VALID
>>   This operation can be used to check that a notification ID
>>   returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>>   is  still  valid  (i.e.,  that  the  target  process still
>>   exists).
>>
>>   The third ioctl(2) argument is a  pointer  to  the  cookie
>

For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Michael Kerrisk (man-pages)
Hi Tycho, Sargun (and all),

I knew it would be a big ask, but below is kind of the manual page
I was hoping you might write [1] for the seccomp user-space notification
mechanism. Since you didn't (and because 5.9 adds various new pieces 
such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
that also will need documenting [2]), I did :-). But of course I may 
have made mistakes...

I've shown the rendered version of the page below, and would love
to receive review comments from you and others, and acks, etc.

There are a few FIXMEs sprinkled into the page, including one
that relates to what appears to me to be a misdesign (possibly 
fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
operation. I would be especially interested in feedback on that
FIXME, and also of course the other FIXMEs.

The page includes an extensive (albeit slightly contrived)
example program, and I would be happy also to receive comments
on that program.

The page source currently sits in a branch (along with the text
that you sent me for the seccomp(2) page) at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

Thanks,

Michael

[1] 
https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
[2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

=

NAME
   seccomp_user_notif - Seccomp user-space notification mechanism

SYNOPSIS
   #include 
   #include 
   #include 

   int seccomp(unsigned int operation, unsigned int flags, void *args);

DESCRIPTION
   This  page  describes  the user-space notification mechanism pro‐
   vided by the Secure Computing (seccomp) facility.  As well as the
   use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
   COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
   operation  described  in  seccomp(2), this mechanism involves the
   use of a number of related ioctl(2) operations (described below).

   Overview
   In conventional usage of a seccomp filter, the decision about how
   to  treat  a particular system call is made by the filter itself.
   The user-space notification mechanism allows the handling of  the
   system  call  to  instead  be handed off to a user-space process.
   The advantages of doing this are that, by contrast with the  sec‐
   comp  filter,  which  is  running on a virtual machine inside the
   kernel, the user-space process has access to information that  is
   unavailable to the seccomp filter and it can perform actions that
   can't be performed from the seccomp filter.

   In the discussion that follows, the process  that  has  installed
   the  seccomp filter is referred to as the target, and the process
   that is notified by  the  user-space  notification  mechanism  is
   referred  to  as  the  supervisor.  An overview of the steps per‐
   formed by these two processes is as follows:

   1. The target process establishes a seccomp filter in  the  usual
  manner, but with two differences:

  · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
the  (successful)  seccomp(2) call is a new "listening" file
descriptor that can be used to receive notifications.

  · In cases where it is appropriate, the seccomp filter returns
the  action value SECCOMP_RET_USER_NOTIF.  This return value
will trigger a notification event.

   2. In order that the supervisor process can obtain  notifications
  using  the  listening  file  descriptor, (a duplicate of) that
  file descriptor must be passed from the target process to  the
  supervisor process.  One way in which this could be done is by
  passing the file descriptor over a UNIX domain socket  connec‐
  tion between the two processes (using the SCM_RIGHTS ancillary
  message type described in unix(7)).   Another  possibility  is
  that  the  supervisor  might  inherit  the file descriptor via
  fork(2).

   3. The supervisor process will receive notification events on the
  listening  file  descriptor.   These  events  are  returned as
  structures of type seccomp_notif.  Because this structure  and
  its  size may evolve over kernel versions, the supervisor must
  first determine the size of  this  structure  using  the  sec‐
  comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
  structure of type seccomp_notif_sizes.  The  supervisor  allo‐
  cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
  to receive notification events.   In  addition,the  supervisor
  allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
  

Re: [PATCH 12/24] getgrent_r.3: Use sizeof() to get buffer size (instead of hardcoding macro name)

2020-09-29 Thread Michael Kerrisk (man-pages)
> 2.- Use sizeof() everywhere, and the macro for the initializer.
>
> pros:
> - It is valid as long as the buffer is an array.
> cons:
> - If the code gets into a function, and the buffer is then a pointer,
>it will definitively produce a silent bug.

Sigh! I just did exactly the last point in a test program I've been writing...

M


Re: [PATCH 22/24] membarrier.2: Note that glibc does not provide a wrapper

2020-09-29 Thread Michael Kerrisk (man-pages)
On 9/27/20 10:05 PM, Alejandro Colomar wrote:
> Hi Branden,
> 
> * G. Branden Robinson via linux-man:
> 
> 1)
> 
>  > .EX
>  > .B int fstat(int \c
>  > .IB fd , \~\c
>  > .B struct stat *\c
>  > .IB statbuf );
>  > .EE
> 
> 2)
> 
>  > .EX
>  > .BI "int fstat(int " fd ", struct stat *" statbuf );
>  > .EE
> 
> 3)
> 
>  > .EX
>  > .BI "int fstat(int\~" fd ", struct stat *" statbuf );
>  > .EE
> 
> I'd say number 2 is best.  Rationale: grep :)
> I agree it's visually somewhat harder, but grepping is way easier.

I'd say number 2 also. But, visually, it's the least difficult
for me.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 22/24] membarrier.2: Note that glibc does not provide a wrapper

2020-09-29 Thread Michael Kerrisk (man-pages)
Hi Branden,

On 9/27/20 7:46 AM, G. Branden Robinson wrote:
> At 2020-09-24T10:06:23+0200, Michael Kerrisk (man-pages) wrote:
>> Thanks for the interesting history, Branden!
> 
> Hi, Michael.  And you're welcome!  I often wonder if I test people's
> patience with my info dumps but I try to show my work when making
> claims.
> 
>> From time toi time I wonder if the function prototypes in
>> the SYNOPSIS should also be inside .EX/.EE. Your thoughts?
> 
> I think there are trade-offs.
> 
> 1. If you want alignment, the monospaced font that .EX/.EE uses is the
>most portable way to get it.
> 2. For typeset output, you'll generally run out of line more quickly
>with a monospaced font than with the troff/man default (Times).
>_Any_ time filling is off, output should be checked to see if it
>overruns the right margin, but this point strengthens in monospace.

Yes, it's a good point. I think I'll leave this idea for now.

> Here's something that isn't a trade-off that might come as a surprise to
> some readers.
> 
> * You can still get bold and italics inside an .EX/.EE region, so you
>   can still use these distinguish data types, variable names, and
>   what-have-you.
> 
> The idiom for achieving this is apparently not well-known among those
> who write man pages by hand, and tools that generate man(7) language
> from some other source often produce output that is so ugly as to be
> unintelligible to non-experts in *roff.
> 
> The key insights are that (A) you can still make macro calls inside an
> .EX/.EE region, and (B) you can use the \c escape to "interrupt" an
> input line and continue it on the next without introducing any
> whitespace.  For instance, looking at fstat() from your stat(2) page, I
> might write it using .EX and .EE as follows:
> 
> .EX
> .B int fstat(int \c
> .IB fd , \~\c
> .B struct stat *\c
> .IB statbuf );
> .EE
> 
> Normally in man pages, it is senseless to have any spaces before the \c
> escape, and \c is best omitted in that event.  However, when filling is
> disabled (as is the case in .EX/.EE regions), output lines break where
> the input lines do by default--\c overrides this, causing the lines to
> be joined.  (Regarding the \~, see below.)
> 
> If there is no use for roman in the line, then you could do the whole
> function signature with the .BI macro by quoting macro arguments that
> contain whitespace.

I was more or less aware of all of the above. (But the \c technique
is something that I see rarely enough that I often take a moment to
remember what it does.)
> 
> .EX
> .BI "int fstat(int " fd ", struct stat *" statbuf );
> .EE
> 
> As a matter of personal style, I find quoted space characters interior
> but adjacent to quotation marks visually confusing--it's slower for me
> to tell which parts of the line are "inside" the quotes and which
> outside--so I turn to groff's \~ non-breaking space escape (widely
> supported elsewhere) for these boundary spaces.
> 
> .EX
> .BI "int fstat(int\~" fd ", struct stat *" statbuf );
> .EE
> 
> Which of the above three models do you think would work best for the
> man-pages project?

I understand what you say about quoted interior spaces being 
a little hard to parse. But I find the \~ makes the source
less readable. Likewise, IMO, the \c technique renders the source
harder to read.

> Also, do you have any use for roman in function signatures?  I see it
> used for comments and feature test macro material, but not within
> function signatures proper.

I think you're correct. Roman only occurs in comments.

> 
> As an aside, I will admit to some unease with the heavy use of bold in
> synopses in section 2 and 3 man pages, 

It's been that way "forever" in the Linux man-pages.

> but I can marshal no historical
> argument against it.  In fact, a quick check of some Unix v7 section 2
> pages from 1979 that I have lying around (thanks to TUHS) reveals that
> Bell Labs used undifferentiated bold for the whole synopsis!
> 
> $ head -n 13 usr/man/man2/stat.2
> .TH STAT 2 
> .SH NAME
> stat, fstat \- get file status
> .SH SYNOPSIS
> .B #include 
> .br
> .B #include 
> .PP
> .B stat(name, buf)
> .br
> .B char *name;
> .br
> .B struct stat *buf;

As ever, thanks for the history!

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] man/statx: Add STATX_ATTR_DAX

2020-09-29 Thread Michael Kerrisk (man-pages)
Hello Ira,

On 9/28/20 6:42 PM, Ira Weiny wrote:
> On Mon, May 04, 2020 at 05:20:16PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny 
>>
>> Linux 5.8 is slated to have STATX_ATTR_DAX support.
>>
>> https://lore.kernel.org/lkml/20200428002142.404144-4-ira.we...@intel.com/
>> https://lore.kernel.org/lkml/20200504161352.GA13783@magnolia/
>>
>> Add the text to the statx man page.
>>
>> Signed-off-by: Ira Weiny 
> 
> Have I sent this to the wrong list?  Or perhaps I have missed a reply.

No, it's just me being a bit slow, I'm sorry. Thank you for pining.

> I don't see this applied to the man-pages project.[1]  But perhaps I am 
> looking
> at the wrong place?

Your patch is applied now, and pushed to kernel .org. Thanks!

Cheers,

Michael

> [1] git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
> 
>> ---
>>  man2/statx.2 | 24 
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/man2/statx.2 b/man2/statx.2
>> index 2e90f07dbdbc..14c4ab78e7bd 100644
>> --- a/man2/statx.2
>> +++ b/man2/statx.2
>> @@ -468,6 +468,30 @@ The file has fs-verity enabled.
>>  It cannot be written to, and all reads from it will be verified
>>  against a cryptographic hash that covers the
>>  entire file (e.g., via a Merkle tree).
>> +.TP
>> +.BR STATX_ATTR_DAX (since Linux 5.8)
>> +The file is in the DAX (cpu direct access) state.  DAX state attempts to
>> +minimize software cache effects for both I/O and memory mappings of this 
>> file.
>> +It requires a file system which has been configured to support DAX.
>> +.PP
>> +DAX generally assumes all accesses are via cpu load / store instructions 
>> which
>> +can minimize overhead for small accesses, but may adversely affect cpu
>> +utilization for large transfers.
>> +.PP
>> +File I/O is done directly to/from user-space buffers and memory mapped I/O 
>> may
>> +be performed with direct memory mappings that bypass kernel page cache.
>> +.PP
>> +While the DAX property tends to result in data being transferred 
>> synchronously,
>> +it does not give the same guarantees of O_SYNC where data and the necessary
>> +metadata are transferred together.
>> +.PP
>> +A DAX file may support being mapped with the MAP_SYNC flag, which enables a
>> +program to use CPU cache flush instructions to persist CPU store operations
>> +without an explicit
>> +.BR fsync(2).
>> +See
>> +.BR mmap(2)
>> +for more information.
>>  .SH RETURN VALUE
>>  On success, zero is returned.
>>  On error, \-1 is returned, and
>> -- 
>> 2.25.1
>>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 12/24] getgrent_r.3: Use sizeof() to get buffer size (instead of hardcoding macro name)

2020-09-24 Thread Michael Kerrisk (man-pages)
On 9/24/20 11:35 AM, Alejandro Colomar wrote:
> Hi,
> 
> On 2020-09-23 22:35, Michael Kerrisk (man-pages) wrote:
>> On 9/15/20 12:03 PM, Stefan Puiu wrote:
>>> Hi,
>>>
>>> On Fri, Sep 11, 2020 at 6:28 PM Alejandro Colomar
>>>  wrote:
>>>>
>>>> Hi Stefan,
>>>>
>>>> On 2020-09-11 16:35, Stefan Puiu wrote:
>>>>   > Hi,
>>>>   >
>>>>   > On Fri, Sep 11, 2020 at 12:15 AM Alejandro Colomar
>>>>   >  wrote:
>>>>   >>
>>>>   >> Signed-off-by: Alejandro Colomar 
>>>>   >> ---
>>>>   >>   man3/getgrent_r.3 | 2 +-
>>>>   >>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>   >>
>>>>   >> diff --git a/man3/getgrent_r.3 b/man3/getgrent_r.3
>>>>   >> index 81d81a851..76deec370 100644
>>>>   >> --- a/man3/getgrent_r.3
>>>>   >> +++ b/man3/getgrent_r.3
>>>>   >> @@ -186,7 +186,7 @@ main(void)
>>>>   >>
>>>>   >>   setgrent();
>>>>   >>   while (1) {
>>>>   >> -i = getgrent_r(, buf, BUFLEN, );
>>>>   >> +i = getgrent_r(, buf, sizeof(buf), );
>>>>   >
>>>>   > I'm worried that less attentive people might copy/paste parts of this
>>>>   > in their code, where maybe buf is just a pointer, and expect it to
>>>>   > work. Maybe leaving BUFLEN here is useful as a reminder that they need
>>>>   > to change something to adapt the code?
>>>>   >
>>>>   > Just my 2 cents,
>>>>   > Stefan.
>>>>   >
>>>> That's a very good point.
>>>>
>>>> So we have 3 options and I will propose now a 4th one.  Let's see all
>>>> of them and see which one is better for the man pages.
>>>>
>>>> 1.- Use the macro everywhere.
>>>>
>>>> pros:
>>>> - It is still valid when the buffer is a pointer and not an array.
>>>> cons:
>>>> - Hardcodes the initializer.  If the array is later initialized with a
>>>> different value, it may produce a silent bug, or a compilation break.
>>>>
>>>> 2.- Use sizeof() everywhere, and the macro for the initializer.
>>>>
>>>> pros:
>>>> - It is valid as long as the buffer is an array.
>>>> cons:
>>>> - If the code gets into a function, and the buffer is then a pointer,
>>>> it will definitively produce a silent bug.
>>>>
>>>> 3.- Use sizeof() everywhere, and a magic number for the initializer.
>>>>
>>>> The same as 2.
>>>>
>>>> 4.- Use ARRAY_BYTES() macro
>>>>
>>>> pros:
>>>> - It is always safe and when code changes, it may break compilation, but
>>>> never a silent bug.
>>>> cons:
>>>> - Add a few lines of code.  Maybe too much complexity for an example.
>>>> But I'd say that it is the only safe option, and in real code it
>>>> should probably be used more, so maybe it's good to show a good 
>>>> practice.
>>>
>>> If you ask me, I think examples should be simple and easy to
>>> understand, and easy to copy/paste in your code. I'd settle for easy
>>> enough, not perfect or completely foolproof. If you need to look up
>>> obscure gcc features to understand an example, that's not very
>>> helpful. So I'd be more inclined to prefer version 1 above. But let's
>>> see Michael's opinion on this.
>>>
>>> Just my 2c,
>>
>> So, the fundamental problem is that C is nearly 50 years old.
>> It's a great high-level assembly language, but when it comes
>> to nuances like this it gets pretty painful. One can do macro
>> magic of the kind you suggest, but I agree with Stefan that it
>> gets confusing and distracting for the reader. I think I also
>> lean to solution 1. Yes, it's not perfect, but it's easy to
>> understand, and I don't think we can or should try and solve
>> the broken-ness of C in the manual pages.
>>
>> Thanks,
>>
>> Michael
>>
>>
> 
> I was reverting the 3 patches I introduced (they changed from solution 1 
> to solution 2), and also was grepping for already existing solution 2 in 
> the pages (it seems that solution 2 was a bit more extended than 
> solution 1).
> 
> While doing that, I'

Re: [PATCH 12/24] getgrent_r.3: Use sizeof() to get buffer size (instead of hardcoding macro name)

2020-09-24 Thread Michael Kerrisk (man-pages)
Hi Alex,

[..]

> I was reverting the 3 patches I introduced (they changed from solution 1
> to solution 2), and also was grepping for already existing solution 2 in
> the pages (it seems that solution 2 was a bit more extended than
> solution 1).

Just so I can refresh my cache, which commits were those?

Thanks,

Michael


Re: [PATCH 22/24] membarrier.2: Note that glibc does not provide a wrapper

2020-09-24 Thread Michael Kerrisk (man-pages)
Hi Branden,

On 9/21/20 4:36 PM, G. Branden Robinson wrote:
> At 2020-09-11T12:58:08+, Walter Harms wrote:
>> the groff commands are ducument in man 7 groff
>> .nf   No filling or adjusting of output-lines.
>> .fi   Fill output lines
>>
>> (for me) a typical use is like this:
>> .nf
>>
>> struct timeval {
>> time_t  tv_sec; /* seconds */
>> suseconds_t tv_usec;/* microseconds */
>> };
>> .fi
>>
>> In the top section you prevent indenting (if any).
> 
> The above will not work as desired for typesetter output, a.k.a., "troff
> devices", such as PostScript or PDF.  The initial code indent might work
> okay but the alignment of the field names will become
> ragged/mis-registered and the comments even more so.

Yes.

> This is because a proportional font is used by default for troff
> devices.  The classical man macros, going back to Version 7 Unix (1979)
> had no good solution for this problem and Unix room tradition at Murray
> Hill going all the way back to (what we now call) the First Edition
> manual in 1971 was to read the man pages on a typewriter--a Teletype
> Model 33 or Model 37.  Typewriters, of course, always[1] used monospaced
> fonts.
> 
> Version 9 Unix (1986) introduced .EX and .EE for setting material in a
> monospaced font even if the device used proportional type by default.
> (Plan 9 troff inherited them.)  GNU roff has supporteds .EX and .EE as
> well, for over 13 years, and its implementations are ultra-permissively
> licensed so other *roffs like Heirloom Doctools have picked them up.
> Therefore I recommend .EX and .EE for all code examples.
> 
> They are very simple to use.  In the above, simply replace ".nf" with
> ".EX" and ".fi" with ".EE".
> 
> Regards,
> Branden
> 
> [1] Not completely true; variable-pitch typewriters (such as 10/12 point
> selectable) were fairly common and some expensive models like the IBM
> Executive even featured true proportional type.

Thanks for the interesting history, Branden!

>From time toi time I wonder if the function prototypes in
the SYNOPSIS should also be inside .EX/.EE. Your thoughts?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 12/24] getgrent_r.3: Use sizeof() to get buffer size (instead of hardcoding macro name)

2020-09-23 Thread Michael Kerrisk (man-pages)
On 9/15/20 12:03 PM, Stefan Puiu wrote:
> Hi,
> 
> On Fri, Sep 11, 2020 at 6:28 PM Alejandro Colomar
>  wrote:
>>
>> Hi Stefan,
>>
>> On 2020-09-11 16:35, Stefan Puiu wrote:
>>  > Hi,
>>  >
>>  > On Fri, Sep 11, 2020 at 12:15 AM Alejandro Colomar
>>  >  wrote:
>>  >>
>>  >> Signed-off-by: Alejandro Colomar 
>>  >> ---
>>  >>   man3/getgrent_r.3 | 2 +-
>>  >>   1 file changed, 1 insertion(+), 1 deletion(-)
>>  >>
>>  >> diff --git a/man3/getgrent_r.3 b/man3/getgrent_r.3
>>  >> index 81d81a851..76deec370 100644
>>  >> --- a/man3/getgrent_r.3
>>  >> +++ b/man3/getgrent_r.3
>>  >> @@ -186,7 +186,7 @@ main(void)
>>  >>
>>  >>   setgrent();
>>  >>   while (1) {
>>  >> -i = getgrent_r(, buf, BUFLEN, );
>>  >> +i = getgrent_r(, buf, sizeof(buf), );
>>  >
>>  > I'm worried that less attentive people might copy/paste parts of this
>>  > in their code, where maybe buf is just a pointer, and expect it to
>>  > work. Maybe leaving BUFLEN here is useful as a reminder that they need
>>  > to change something to adapt the code?
>>  >
>>  > Just my 2 cents,
>>  > Stefan.
>>  >
>> That's a very good point.
>>
>> So we have 3 options and I will propose now a 4th one.  Let's see all
>> of them and see which one is better for the man pages.
>>
>> 1.- Use the macro everywhere.
>>
>> pros:
>> - It is still valid when the buffer is a pointer and not an array.
>> cons:
>> - Hardcodes the initializer.  If the array is later initialized with a
>>different value, it may produce a silent bug, or a compilation break.
>>
>> 2.- Use sizeof() everywhere, and the macro for the initializer.
>>
>> pros:
>> - It is valid as long as the buffer is an array.
>> cons:
>> - If the code gets into a function, and the buffer is then a pointer,
>>it will definitively produce a silent bug.
>>
>> 3.- Use sizeof() everywhere, and a magic number for the initializer.
>>
>> The same as 2.
>>
>> 4.- Use ARRAY_BYTES() macro
>>
>> pros:
>> - It is always safe and when code changes, it may break compilation, but
>>never a silent bug.
>> cons:
>> - Add a few lines of code.  Maybe too much complexity for an example.
>>But I'd say that it is the only safe option, and in real code it
>>should probably be used more, so maybe it's good to show a good practice.
> 
> If you ask me, I think examples should be simple and easy to
> understand, and easy to copy/paste in your code. I'd settle for easy
> enough, not perfect or completely foolproof. If you need to look up
> obscure gcc features to understand an example, that's not very
> helpful. So I'd be more inclined to prefer version 1 above. But let's
> see Michael's opinion on this.
> 
> Just my 2c,

So, the fundamental problem is that C is nearly 50 years old.
It's a great high-level assembly language, but when it comes
to nuances like this it gets pretty painful. One can do macro
magic of the kind you suggest, but I agree with Stefan that it
gets confusing and distracting for the reader. I think I also
lean to solution 1. Yes, it's not perfect, but it's easy to 
understand, and I don't think we can or should try and solve
the broken-ness of C in the manual pages.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v5 1/3] open: add close_range()

2020-09-17 Thread Michael Kerrisk (man-pages)
Hey Christian,

Could we please have a manual page for the close_range(2) syscall
that's about to land in 5.9?

Thanks,

Michael

On Wed, 3 Jun 2020 at 12:24, Michael Kerrisk (man-pages)
 wrote:
>
> Hi Christian,
>
> Could we have a manual page for this API (best before it's merged)?
>
> Thanks,
>
> Michael
>
> On Tue, 2 Jun 2020 at 22:44, Christian Brauner
>  wrote:
> >
> > This adds the close_range() syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > I've also coordinated with some FreeBSD developers who got in touch with
> > me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> > Quite a bunch of projects in userspace are waiting on this syscall
> > including Python and systemd.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> > /* that exec is sensitive */
> > unshare(CLONE_FILES);
> > /* we don't want anything past stderr here */
> > close_range(3, ~0U);
> > execve();
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> > task is multi-threaded and shares the file descriptor table with another
> > thread in which case two threads could race with one thread allocating file
> > descriptors and the other one closing them via close_range(). For the
> > general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc//fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >   - Python (cf. [2])
> >   - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> > int dir_fd;
> > DIR *dir;
> > struct dirent *direntp;
> >
> > dir = opendir("/proc/self/fd");
> > if (!dir)
> > return -1;
> > dir_fd = dirfd(dir);
> > while ((direntp = readdir(dir))) {
> > int fd;
> > if (strcmp(direntp->d_name, ".") == 0)
> > continue;
> > if (strcmp(direntp->d_name, "..") == 0)
> > continue;
> > fd = atoi(direntp->d_name);
> > if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> > continue;
> > close(fd);
> > }
> > closedir(dir);
> > return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> >- close_all_fds(): ~280 us
> >- close_range():~24 us
> >
> > 2. closing 1000 open files:
> >- close_all_fd

  1   2   3   4   5   6   7   8   9   10   >