On Thu, 7 Nov 2024 at 16:54, Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote: > > On 2024-11-07 10:46, Marco Elver wrote: > > On Thu, 7 Nov 2024 at 16:45, Mathieu Desnoyers > > <mathieu.desnoy...@efficios.com> wrote: > >> > >> On 2024-11-07 07:25, Marco Elver wrote: > >>> prctl() is a complex syscall which multiplexes its functionality based > >>> on a large set of PR_* options. Currently we count 64 such options. The > >>> return value of unknown options is -EINVAL, and doesn't distinguish from > >>> known options that were passed invalid args that also return -EINVAL. > >>> > >>> To understand if programs are attempting to use prctl() options not yet > >>> available on the running kernel, provide the task_prctl_unknown > >>> tracepoint. > >>> > >>> Note, this tracepoint is in an unlikely cold path, and would therefore > >>> be suitable for continuous monitoring (e.g. via perf_event_open). > >>> > >>> While the above is likely the simplest usecase, additionally this > >>> tracepoint can help unlock some testing scenarios (where probing > >>> sys_enter or sys_exit causes undesirable performance overheads): > >>> > >>> a. unprivileged triggering of a test module: test modules may > >>> register a > >>> probe to be called back on task_prctl_unknown, and pick a very > >>> large > >>> unknown prctl() option upon which they perform a test function for > >>> an > >>> unprivileged user; > >>> > >>> b. unprivileged triggering of an eBPF program function: similar > >>> as idea (a). > >>> > >>> Example trace_pipe output: > >>> > >>> test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test > >>> option=1234 arg2=101 arg3=102 arg4=103 arg5=104 > >>> > >> > >> My concern is that we start adding tons of special-case > >> tracepoints to the implementation of system calls which > >> are redundant with the sys_enter/exit tracepoints. > >> > >> Why favor this approach rather than hooking on sys_enter/exit ? > > > > It's __extremely__ expensive when deployed at scale. See note in > > commit description above. > > I suspect you base the overhead analysis on the x86-64 implementation > of sys_enter/exit tracepoint and especially the overhead caused by > the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ? > > If that is causing a too large overhead, we should investigate if > those can be improved instead of adding tracepoints in the > implementation of system calls.
Doing that may be generally useful, but even if you improve it somehow, there's always some additional bit of work needed on sys_enter/exit as soon as a tracepoint is attached. Even if that's just a few cycles, it's too much (for me at least). Also: if you just hook sys_enter/exit, you don't know if the prctl was handled or not by inspecting the return code (-EINVAL). I want the kernel to tell me if it handled the prctl() or not, and I also think it's very bad design to copy-paste the prctl() option checking of the running kernel in a sys_enter/exit hook. This doesn't scale in terms of performance nor maintainability.