On Wed, Jul 01 2026 at 11:29, H. Peter Anvin wrote: Can you please trim your replies? Scrolling through hundred lines of useless quoted text is just annoying.
> On July 1, 2026 10:42:08 AM PDT, "Michal Suchánek" <[email protected]> wrote: >>-static __always_inline long syscall_enter_from_user_mode(struct pt_regs >>*regs, long syscall) >>+static __always_inline long syscall_enter_from_user_mode(struct pt_regs >>*regs, long *syscall) >> { >> long ret; >> > 1. The type for a system call is int. That ship has sailed long ago. man syscall ... > 2. A valid system call number is always going to be positive. That's true today. > 3. Bits [30:24] are available for architecture ABI use. The > "architecture independent" part of the system call number is therefore > 24 bits wide. > > 4. The exact ABI is platform-specific, obviously, but as a general > guideline (especially for new platforms/ABIs) should follow the rules > for a platform "int" if practical. Notably, when passing a value in a > register larger than 32 bits, which side of the calling interface is > responsible for sign-extending a value passed in a register. If caller > side, the kernel should validate, if callee side the kernel should > ignore the additional bits and do the extension. The kernel sign expands today already, i.e. for compat syscalls. > 5. A negative system call number is guaranteed to return -ENOSYS > (unless intercepted by seccomp, ptrace, or another mechanism under > user space control.) That's true today. ASM entry: regs->eax = -ENOSYS; C entry: nr = syscall_enter_from_user_mode(regs, nr); if ((unsigned)nr < SYSCALL_MAX) regs->eax = handle_syscall(); else if (nr != -1) regs->eax = -ENOSYS; .... If seccomp overwrites regs->eax and aborts any syscall (including -1) by returning -1, then the value seccomp wrote into regs->eax is preserved and returned to user space. The same applies for syscall_user_dispatch() and ptrace...() if they decide to overwrite regs->eax _and_ abort the syscall by letting syscall_enter_from_user_mode() return -1. trace_syscall_enter() is not any different. If the magic BPF in there rewrites the syscall number to -1 then either the original -ENOSYS or the BPF induced overwrite is returned to user space. It's less than obvious and I have no objections to clean that up and make it more intuitive, but I still fail to see what Michal is actually trying to solve and what the magic flag is for. If s390 requires it, then that's an s390 problem, but definitely x86 does not. > 6. If the platform needs to algorithmically modify the system call > number due to platform-specific concerns (say, the platform uses a > 16-bit special purpose register for the syscall number, or it has > multiple kernel entry points with different behavior), it should if at > all possible transcode the system call number as necessary to match > this convention in APIs that are exposed to general kernel code. > > For example, in the future I could very much see the IA32 code in the > x86 kernel using bit 29 internally to indicate an ia32 system call, > simplifying the is_compat implementation on x86. I don't see how that makes it simpler. Those are two different entry code paths and magic bits wont make that go away. > It should not mean that passing bit 29 to either the syscall > instruction or int $0x80 will be accepted. Your proposal looks even more like a solution in search of a problem than the original one. Thanks, tglx
