On July 2, 2026 2:49:56 PM PDT, Thomas Gleixner <[email protected]> wrote: >On Wed, Jul 01 2026 at 11:29, H. Peter Anvin wrote: > >Can you please trim your replies? Scrolling through hundred lines of >useless quoted text is just annoying. > >> On July 1, 2026 10:42:08 AM PDT, "Michal Suchánek" <[email protected]> wrote: >>>-static __always_inline long syscall_enter_from_user_mode(struct pt_regs >>>*regs, long syscall) >>>+static __always_inline long syscall_enter_from_user_mode(struct pt_regs >>>*regs, long *syscall) >>> { >>> long ret; >>> > >> 1. The type for a system call is int. > >That ship has sailed long ago. man syscall ... > >> 2. A valid system call number is always going to be positive. > >That's true today. > >> 3. Bits [30:24] are available for architecture ABI use. The >> "architecture independent" part of the system call number is therefore >> 24 bits wide. >> >> 4. The exact ABI is platform-specific, obviously, but as a general >> guideline (especially for new platforms/ABIs) should follow the rules >> for a platform "int" if practical. Notably, when passing a value in a >> register larger than 32 bits, which side of the calling interface is >> responsible for sign-extending a value passed in a register. If caller >> side, the kernel should validate, if callee side the kernel should >> ignore the additional bits and do the extension. > >The kernel sign expands today already, i.e. for compat syscalls. > >> 5. A negative system call number is guaranteed to return -ENOSYS >> (unless intercepted by seccomp, ptrace, or another mechanism under >> user space control.) > >That's true today. > >ASM entry: > regs->eax = -ENOSYS; > >C entry: > nr = syscall_enter_from_user_mode(regs, nr); > > if ((unsigned)nr < SYSCALL_MAX) > regs->eax = handle_syscall(); > else if (nr != -1) > regs->eax = -ENOSYS; > > .... > >If seccomp overwrites regs->eax and aborts any syscall (including -1) by >returning -1, then the value seccomp wrote into regs->eax is preserved >and returned to user space. > >The same applies for syscall_user_dispatch() and ptrace...() if they >decide to overwrite regs->eax _and_ abort the syscall by letting >syscall_enter_from_user_mode() return -1. > >trace_syscall_enter() is not any different. If the magic BPF in there >rewrites the syscall number to -1 then either the original -ENOSYS or >the BPF induced overwrite is returned to user space. > >It's less than obvious and I have no objections to clean that up and >make it more intuitive, but I still fail to see what Michal is actually >trying to solve and what the magic flag is for. If s390 requires it, >then that's an s390 problem, but definitely x86 does not. > >> 6. If the platform needs to algorithmically modify the system call >> number due to platform-specific concerns (say, the platform uses a >> 16-bit special purpose register for the syscall number, or it has >> multiple kernel entry points with different behavior), it should if at >> all possible transcode the system call number as necessary to match >> this convention in APIs that are exposed to general kernel code. >> >> For example, in the future I could very much see the IA32 code in the >> x86 kernel using bit 29 internally to indicate an ia32 system call, >> simplifying the is_compat implementation on x86. > >I don't see how that makes it simpler. Those are two different entry >code paths and magic bits wont make that go away. > >> It should not mean that passing bit 29 to either the syscall >> instruction or int $0x80 will be accepted. > >Your proposal looks even more like a solution in search of a problem >than the original one. > >Thanks, > > tglx > >
The type in syscall(3) is irrelevant. The argument passed to the kernel is treated as an int and sign-extended from 32 bits. I'm explicitly not trying to invent things; I'm trying to document the status quo to avoid further confusion and to create mistakes. I'm sorry I muddled the waters with what was intended to be a hypothetical example.
