On Wed, Jul 01, 2026 at 11:29:01AM -0700, H. Peter Anvin wrote:
> On July 1, 2026 10:42:08 AM PDT, "Michal Suchánek" <[email protected]> wrote:
> >The return value of syscall_enter_from_user_mode is used both for the
> >adjusted syscall number and the indicator that a syscall should be
> >skipped.
> >
> >As seccomp can be invoked on any syscall, including invalid ones this
> >somewhat undermines seccomp.
> >
> >While the seccomp variants that terminate the process do not need to
> >care about this for the filter that sets the syscall return value this
> >disctinction is required.
> >
> >Pass the syscall number as a pointer to the inline entry functions, and
> >use the return value exclusively for the indication that the syscall is
> >already handled.
> >
> >This should avoid the need for the s390 PIF_SYSCALL_RET_SET which is the
> >workaround for exactly this deficiency.
> >
> >If this is desirable the patch could be split into some series that
> >adjusts the code flow where needed so that the final change is mostly
> >mechanical.
> >
> >There is also another way to handle this problem.
> >
> >With x86 using bit 30 to denote compatibility syscall it sounds like
> >declaring syscall number a 30bit quantity would work.
> >
> >Then bit 31 could be used to denote an invalid syscall that can never be
> >executed, and the -1 returned from syscall_enter_from_user_mode would
> >then be inherently invalid.
> >
> >That is so long as no architectures use syscall numbers outside of this
> >range so far, and the limitation is considered fine.
> > 
> 
> Negative numbers most definitely not be assigned as valid system calls, not 
> now, not ever. 

Negativity of a number is a matter of intepretation. Sometimes the
syscall number is decleared as int, sometimes long, sometimes unsigned
long.

Passing -1 to strtoul generates some bit pattern that can then be
compared to another bit pattern inside a seccomp filter program, for
example.

> Therein lies some serious madness.
> 
> I believe setting the syscall number to -1 to skip is an ABI already in e.g. 
> ptrace, so I doubt we can just get rid of it anyway. 

Yes, and seccomp can set the syscall number to -1 indicating it was
handled already even if the number was -1 to start with. While -1 is not
a valid syscall number it can still be filtered, at least on some
architectures.

> I would say as follows:
> 
> Let's formally define that: 
> 
> - valid system call numbers are positive 32-bit numbers, using the 
> appropriate ABI convention for "int".
> 
> - bits [30:n] for some value of n are reserved for architecture-specific 
> flags/modes. MIPS uses an offset of 2000 decimal between its syscall ABIs, 
> which would imply n ~ 11, although I personally think that is too restrictive 
> (MIPS could in fact use such a flag to provide an escape into a larger number 
> space if we ever need more than 2000 system calls.)
> 
> I would suggest n = 24, at least for now. It is easier to give up additional 
> bits later than to claw them back when already used. 
> 
> Thus: 
> 
> 1. The type for a system call is int.
> 
> 2. A valid system call number is always going to be positive.
> 
> 3. Bits [30:24] are available for architecture ABI use. The "architecture 
> independent" part of the system call number is therefore 24 bits wide.

Will that also work correctly with seccomp?

As I understand it the current situation is that on x86 the BPF code
passed to seccomp must filter the compat syscall bit in the PBF code,
and I do not see how restricting the syscall value to 24bit would happen
without changing the seccomp filter API.

See eg. https://lore.kernel.org/linuxppc-dev/[email protected]/
for sample code.

> 
> 4. The exact ABI is platform-specific, obviously, but as a general guideline 
> (especially for new platforms/ABIs) should follow the rules for a platform 
> "int" if practical. Notably, when passing a value in a register larger than 
> 32 bits, which side of the calling interface is responsible for 
> sign-extending a value passed in a register. If caller side, the kernel 
> should validate, if callee side the kernel should ignore the additional bits 
> and do the extension.

Do we even want to play with sign-extend?

If the syscall number is >= 1<<n after masking off flags recognized by
the platfrom (if any) it's invalid.

> 5. A negative system call number is guaranteed to return -ENOSYS (unless 
> intercepted by seccomp, ptrace, or another mechanism under user space 
> control.)

Interception by seccomp is exactly the case that's wonky.

> 6. If the platform needs to algorithmically modify the system call number due 
> to platform-specific concerns (say, the platform uses a 16-bit special 
> purpose register for the syscall number, or it has multiple kernel entry 
> points with different behavior), it should if at all possible transcode the 
> system call number as necessary to match this convention in APIs that are 
> exposed to general kernel code. 
> 
> For example, in the future I could very much see the IA32 code in the x86 
> kernel using bit 29 internally to indicate an ia32 system call, simplifying 
> the is_compat implementation on x86. It should not mean that passing bit 29 
> to either the syscall instruction or int $0x80 will be accepted.

As I understand the code it uses bit 30 for that. Maybe I missed
something?

Thanks

Michal

Reply via email to