On Thu, Dec 21, 2023 at 2:05 PM Haakon Bugge <[email protected]> wrote:
> > On 18 Dec 2023, at 23:09, Paul Moore <[email protected]> wrote:
> > On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <[email protected]> 
> > wrote:
> >>> On 14 Dec 2023, at 00:45, Paul Moore <[email protected]> wrote:
> >>> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <[email protected]> 
> >>> wrote:
> >>>>
> >>>> For the most time-consuming function, when running a syscall benchmark
> >>>> with STIG compliant audit rules:
> >>>>
> >>>> Overhead  Command       Shared Object      Symbol
> >>>> .........  ............  .................  ........................
> >>>>
> >>>>   27.62%  syscall_lat   [kernel.kallsyms]  [k] __audit_filter_op
> >>>>
> >>>> we apply codegen optimizations, which speeds up the syscall
> >>>> performance by around 17% on an Intel Cascade Lake system.
> >>>>
> >>>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C
> >>>> application that measures average syscall latency from getpid()
> >>>> running 100 million rounds.
> >>>>
> >>>> Between each perf run, we reboot the system and waits until the last
> >>>> minute load is less than 1.0.
> >>>>
> >>>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to
> >>>> amplify the changes in the audit system.
> >>>>
> >>>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and
> >>>> "mitigations=off" and with the commit "audit: Vary struct audit_entry
> >>>> alignment" on an Intel Cascade Lake system. The following three
> >>>> metrics are reported, nanoseconds per syscall, L1D misses per syscall,
> >>>> and finally Intructions Per Cycle, ipc.
> >>>>
> >>>> Base vs. base + this commit gives:
> >>>>
> >>>> ns per call:
> >>>> min avg max   pstdev
> >>>> - 203 203 209 0.954149
> >>>> + 173 173 178 0.884534
> >>>>
> >>>> L1d misses per syscall:
> >>>>    min    avg    max   pstdev
> >>>> -  0.012  0.103  0.817 0.238352
> >>>> +  0.010  0.209  1.235 0.399416
> >>>>
> >>>> ipc:
> >>>>   min    avg    max   pstdev
> >>>> - 2.320  2.329  2.330 0.003000
> >>>> + 2.430  2.436  2.440 0.004899
> >>>>
> >>>> Signed-off-by: Håkon Bugge <[email protected]>
> >>>> ---
> >>>> kernel/auditsc.c | 2 ++
> >>>> 1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >>>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644
> >>>> --- a/kernel/auditsc.c
> >>>> +++ b/kernel/auditsc.c
> >>>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule 
> >>>> *rule, unsigned long val)
> >>>> * parameter can be NULL, but all others must be specified.
> >>>> * Returns 1/true if the filter finds a match, 0/false if none are found.
> >>>> */
> >>>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", 
> >>>> "align-jumps=16")
> >>>
> >>> The kernel doesn't really make use of #pragma optimization statements
> >>> like this, at least not in any of the core areas, and I'm not
> >>> interested in being the first to do so.  I appreciate the time and
> >>> effort that you have spent profiling the audit subsystem, but this
> >>> isn't a patch I can accept at this point in time, I'm sorry.
> >>
> >> Fair enough. Will a function attribute aka:
> >>
> >> __attribute__((optimize("foo=bar")))
> >>
> >> be acceptable for you?
> >
> > Unless you can show me widespread acceptance of these types of
> > optimizations in core kernel code, my answer is no.  I'm sorry.
>
> Hi Paul,
>
> Well, I guess we both know that I cannot show you any "widespread acceptance" 
> of the sort. Here is what I see:
>
> arch/arm/lib/xor-neon.c:#pragma GCC optimize "tree-vectorize"
> (not what you describe as core kernel)
>
> lib/zstd/common/compiler.h:#    define DONT_VECTORIZE 
> __attribute__((optimize("no-tree-vectorize")))
> (core, but not widespread)
>
> Then of course, there are more examples of specific optimizations or disable 
> thereof in Makefiles:
>
> kernel/debug/kdb/Makefile:clean-files := gen-kdb_cmds.c
> kernel/kcsan/Makefile:CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \
> kernel/kcsan/Makefile: -fno-stack-protector -DDISABLE_BRANCH_PROFILING
> kernel/kcsan/Makefile:CFLAGS_kcsan_test.o := $(CFLAGS_KCSAN) 
> -fno-omit-frame-pointer
> kernel/Makefile:CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) 
> -fno-stack-protector
> kernel/Makefile:clean-files := kheaders_data.tar.xz kheaders.md5
> kernel/sched/Makefile:# According to Alan Modra <[email protected]>, the 
> -fno-omit-frame-pointer is
> kernel/sched/Makefile:CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
> kernel/bpf/Makefile:cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := 
> -fno-gcse
> kernel/entry/Makefile:CFLAGS_REMOVE_common.o = -fstack-protector 
> -fstack-protector-strong
> kernel/entry/Makefile:CFLAGS_common.o += -fno-stack-protector
> kernel/rcu/Makefile:KBUILD_CFLAGS += -g -fno-omit-frame-pointer
>
> My view is:
>
> The coding of time critical functions, that access structs, that are 
> allocated by k{m,z}alloc and family, in a short proximity in time, is, with 
> all respect, not a good programming practice, as depicted by the example in 
> this patch series.
>
> I have to accept, that you will not accept this patch series or a v2 thereof.
>
> But, this leaves (210 - 173) = 37 nanosecond unnecessary overhead of a 
> syscall on the floor.
>
> It should be unnecessary to quote the number of Cascade Lake systems being 
> deployed, how many of them that run with audit enabled and STIG component 
> rules, to point of the significance of those nanoseconds.
>
> I would be very appreciative if you could come up with a better and more 
> acceptable solution, in order to shave off these nanoseconds.

I appreciate and understand your frustration, but sometimes the
subjective quality and maintainability of the code needs to take
precedence, and in my mind the tradeoff for a 37ns win is not
desirable.  Thank you again for your time and effort.

-- 
paul-moore.com

Reply via email to