Steve Grubb <sgr...@redhat.com> writes:
> Hello, > > Thanks for the detailed notes on this investigation. It really is a lot of > good information backing this up. However, there will come a day when someone > sees this "major = ctx->major" and they will send a patch to "fix" this > unnecessary assignment. If you are sending a V2 of this set, I would suggest > adding some comment in the code that this is for a performance improvement > and to see the commit message for additional info. Thanks for the comment. Just sent out v2 of the last patch which addresses this tangentially -- it adds a separate function parameter for ctx->major/uring_op, so this shouldn't be an issue anymore. Thanks Ankur > > Thanks, > -Steve > > On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote: >> ctx->major contains the current syscall number. This is, of course, a >> constant for the duration of the syscall. Unfortunately, GCC's alias >> analysis cannot prove that it is not modified via a pointer in the >> audit_filter_syscall() loop, and so always loads it from memory. >> >> In and of itself the load isn't very expensive (ops dependent on the >> ctx->major load are only used to determine the direction of control flow >> and have short dependence chains and, in any case the related branches >> get predicted perfectly in the fastpath) but still cache ctx->major >> in a local for two reasons: >> >> * ctx->major is in the first cacheline of struct audit_context and has >> similar alignment as audit_entry::list audit_entry. For cases >> with a lot of audit rules, doing this reduces one source of contention >> from a potentially busy cache-set. >> >> * audit_in_mask() (called in the hot loop in audit_filter_syscall()) >> does cast manipulation and error checking on ctx->major: >> >> audit_in_mask(const struct audit_krule *rule, unsigned long val): >> if (val > 0xffffffff) >> return false; >> >> word = AUDIT_WORD(val); >> if (word >= AUDIT_BITMASK_SIZE) >> return false; >> >> bit = AUDIT_BIT(val); >> >> return rule->mask[word] & bit; >> >> The clauses related to the rule need to be evaluated in the loop, but >> the rest is unnecessarily re-evaluated for every loop iteration. >> (Note, however, that most of these are cheap ALU ops and the branches >> are perfectly predicted. However, see discussion on cycles >> improvement below for more on why it is still worth hoisting.) >> >> On a Skylakex system change in getpid() latency (aggregated over >> 12 boot cycles): >> >> Min Mean Median Max pstdev >> (ns) (ns) (ns) (ns) >> >> - 201.30 216.14 216.22 228.46 (+- 1.45%) >> + 196.63 207.86 206.60 230.98 (+- 3.92%) >> >> Performance counter stats for 'bin/getpid' (3 runs) go from: >> cycles 836.89 ( +- .80% ) >> instructions 2000.19 ( +- .03% ) >> IPC 2.39 ( +- .83% ) >> branches 430.14 ( +- .03% ) >> branch-misses 1.48 ( +- 3.37% ) >> L1-dcache-loads 471.11 ( +- .05% ) >> L1-dcache-load-misses 7.62 ( +- 46.98% ) >> >> to: >> cycles 805.58 ( +- 4.11% ) >> instructions 1654.11 ( +- .05% ) >> IPC 2.06 ( +- 3.39% ) >> branches 430.02 ( +- .05% ) >> branch-misses 1.55 ( +- 7.09% ) >> L1-dcache-loads 440.01 ( +- .09% ) >> L1-dcache-load-misses 9.05 ( +- 74.03% ) >> >> (Both aggregated over 12 boot cycles.) >> >> instructions: we reduce around 8 instructions/iteration because some of >> the computation is now hoisted out of the loop (branch count does not >> change because GCC, for reasons unclear, only hoists the computations >> while keeping the basic-blocks.) >> >> cycles: improve by about 5% (in aggregate and looking at individual run >> numbers.) This is likely because we now waste fewer pipeline resources >> on unnecessary instructions which allows the control flow to >> speculatively execute further ahead shortening the execution of the loop >> a little. The final gating factor on the performance of this loop >> remains the long dependence chain due to the linked-list load. >> >> Signed-off-by: Ankur Arora <ankur.a.ar...@oracle.com> >> --- >> kernel/auditsc.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/auditsc.c b/kernel/auditsc.c >> index 79a5da1bc5bb..533b087c3c02 100644 >> --- a/kernel/auditsc.c >> +++ b/kernel/auditsc.c >> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct >> *tsk, { >> struct audit_entry *e; >> enum audit_state state; >> + unsigned long major = ctx->major; >> >> if (auditd_test_task(tsk)) >> return; >> >> rcu_read_lock(); >> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], > list) { >> - if (audit_in_mask(&e->rule, ctx->major) && >> + if (audit_in_mask(&e->rule, major) && >> audit_filter_rules(tsk, &e->rule, ctx, NULL, >> &state, false)) { >> rcu_read_unlock(); -- ankur -- Linux-audit mailing list Linux-audit@redhat.com https://listman.redhat.com/mailman/listinfo/linux-audit