As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()") system call trace events allow faulting in user space memory. Have some of the system call trace events take advantage of this.
Introduce a way to read from user space addresses from the syscall trace event. The way this is accomplished is by creating a per CPU temporary buffer that is used to read unsafe user memory. When a syscall trace event needs to read user memory, it reads the per CPU sched switch counter. It then disables migration, enables preemption, copies the user space memory into this buffer, then disables preemption again. If the counter is the same as the original value the buffer is valid. Otherwise it needs to try again. This is similar to how seqcount works, but uses the per CPU sched switch counter as its sequence counter. If the counter is not the same, it means another task scheduled in, and that task could have used the same buffer and overwritten the data. A new file is created in the tracefs directory (and also per instance) that allows the user to shorten the amount copied from user space. It can be completely disabled if set to zero (it will only display "" or (, ...) but no copying from user space will be performed). The max size to copy is hard coded to 128, which should be enough for this purpose. This allows the output to look like this: sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4) sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820) sys_write(fd: 1, buf: 0x56430f353be0 (2f:72:6f:6f:74:0a) "/root.", count: 6) sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6) sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1) Changes since v1: https://lore.kernel.org/linux-trace-kernel/20250805192646.328291...@kernel.org/ - Removed __rcu annotation to the fields that do not need RCU to protect them. - Hide newsfstat around #if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64) as parisc failed to build without it. (kernel test robot) - Fixed allocation of sinfo which used sizeof(sinfo) and not sizeof(*sinfo) (kernel test robot) - Instead of incrementing a counter via the sched_switch tracepoint, use the nr_context_switches() API. (Mathieu Desnoyers). - Use the length saved in the meta data of the event to limit the size of the string printed "%.*s", len, str. - Add comment describing that the method to read the memory from user space is similar to how seqcount works. - Hide kexec_file_load around #if defined(__ARCH_WANT_TIME32_SYSCALLS) || __BITS_PER_LONG != 32 to not break the i386 build. - Added __user annotation to variable copying from user (kernel test robot) - Change default to 63 (127 seemed too much) - Change the max to 165 to fill in the extra data. - Use the size macros of the max size and max args to calculate the size of the buffer to save the values in. - Added new patch to show printable characters of binary arrays that are displayed. Steven Rostedt (8): tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() tracing: Have syscall trace events show "0x" for values greater than 10 tracing: Have syscall trace events read user space string tracing: Have system call events record user array data tracing: Display some syscall arrays as strings tracing: Allow syscall trace events to read more than one user parameter tracing: Add syscall_user_buf_size to limit amount written tracing: Show printable characters in syscall arrays ---- Documentation/trace/ftrace.rst | 8 + include/trace/syscall.h | 8 +- kernel/trace/Kconfig | 13 + kernel/trace/trace.c | 52 +++ kernel/trace/trace.h | 7 +- kernel/trace/trace_syscalls.c | 700 +++++++++++++++++++++++++++++++++++++++-- 6 files changed, 756 insertions(+), 32 deletions(-)