Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-28 Thread Christian Brauner
On Mon, Nov 27, 2023 at 09:10:54AM -0800, Linus Torvalds wrote:
> On Mon, 27 Nov 2023 at 02:27, Christian Brauner  wrote:
> >
> > So I've picked up your patch (vfs.misc). It's clever alright so thanks
> > for the comments in there otherwise I would've stared at this for far
> > too long.
> 
> Note that I should probably have commented on one other thing: that
> whole "just load from fd[0] is always safe, because the fd[] array
> always exists".

I added a comment to that effect in the code.

> 
> IOW, that whole "load and mask" thing only works when you know the
> array exists at all.
> 
> Doing that "just mask the index" wouldn't be valid if "size = 0" is an
> option and might mean that we don't have an array at all (ie if "->fd"
> itself could be NULL.
> 
> But we never have a completely empty file descriptor array, and
> fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.
> 
> (The whole 'tsk->files' could be NULL, but only for kernel threads or
> when exiting, so fget_task() will check for *that*, but it's a
> separate thing)

Yep.

> 
> So that's why it's safe to *entirely* remove the whole
> 
> if (unlikely(fd >= fdt->max_fds))
> 
> test, and do it *all* with just "mask the index, and mask the resulting load".

Yep.


Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-27 Thread Linus Torvalds
On Mon, 27 Nov 2023 at 02:27, Christian Brauner  wrote:
>
> So I've picked up your patch (vfs.misc). It's clever alright so thanks
> for the comments in there otherwise I would've stared at this for far
> too long.

Note that I should probably have commented on one other thing: that
whole "just load from fd[0] is always safe, because the fd[] array
always exists".

IOW, that whole "load and mask" thing only works when you know the
array exists at all.

Doing that "just mask the index" wouldn't be valid if "size = 0" is an
option and might mean that we don't have an array at all (ie if "->fd"
itself could be NULL.

But we never have a completely empty file descriptor array, and
fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.

(The whole 'tsk->files' could be NULL, but only for kernel threads or
when exiting, so fget_task() will check for *that*, but it's a
separate thing)

So that's why it's safe to *entirely* remove the whole

if (unlikely(fd >= fdt->max_fds))

test, and do it *all* with just "mask the index, and mask the resulting load".

Because we can *always* do that load at "fdt->fd[0]", and we want to
check the result for NULL anyway, so the "mask at the end and check
for NULL" is both natural and generates very good code.

Anyway, not a big deal, bit it might be worth noting before somebody
tries the same trick on some other array that *could* be zero-sized
and with a NULL base pointer, and where that 'array[0]' access isn't
necessarily guaranteed to be ok.

> It's a little unpleasant because of the cast-orama going on before we
> check the file pointer but I don't see that it's in any way wrong.

In my cleanup phase - which was a bit messy - I did wonder if I should
have some helper for it, since it shows up in both __fget_files_rcu()
and in files_lookup_fd_raw().

So I *could* have tried to add something like a
"masked_rcu_dereference()" that took the base pointer, the index, and
the mask, and did that whole dance.

Or I could have had just a "mask_pointer()" function, which we do
occasionally do in other places too (ie we hide data in low bits, and
then we mask them away when the pointer is used as a pointer).

But with only two users, it seemed to add more conceptual complexity
than it's worth, and I was not convinced that we'd want to expose that
pattern and have others use it.

So having a helper might clarify things, but it might also encourage
wrong users. I dunno.

I suspect the only real use for this ends up being this very special
"access the fdt->fd[] array using a file descriptor".

Anyway, that's why I largely just did it with comments, and commented
both places - and just kept the cast there in the open.

 Linus


Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-27 Thread Mateusz Guzik
On Mon, Nov 20, 2023 at 03:11:31PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops 
> on:
> 
> 
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to 
> SLAB_TYPESAFE_BY_RCU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
>  --- 
>  %stddev %change %stddev
>  \  |\  
[snip]
>  30.90 ±  4% -20.6   10.35 ±  2%  
> perf-profile.self.cycles-pp.__fget_light
>   0.00   +26.5   26.48
> perf-profile.self.cycles-pp.__get_file_rcu
[snip]

So __fget_light now got a func call.

I don't know if this is worth patching (and benchmarking after), but I
if sorting this out is of interest, triviality below is probably the
easiest way out:

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..d8d3e18800c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -856,14 +856,14 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(>file_lock);
 }
 
-static struct file *__get_file_rcu(struct file __rcu **f)
+static __always_inline struct file *__get_file_rcu(struct file __rcu **f)
 {
struct file __rcu *file;
struct file __rcu *file_reloaded;
struct file __rcu *file_reloaded_cmp;
 
file = rcu_dereference_raw(*f);
-   if (!file)
+   if (unlikely(!file))
return NULL;
 
if (unlikely(!atomic_long_inc_not_zero(>f_count)))
@@ -891,7 +891,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 * If the pointers don't match the file has been reallocated by
 * SLAB_TYPESAFE_BY_RCU.
 */
-   if (file == file_reloaded_cmp)
+   if (likely(file == file_reloaded_cmp))
return file_reloaded;
 
fput(file);


Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-27 Thread Christian Brauner
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

So I've picked up your patch (vfs.misc). It's clever alright so thanks
for the comments in there otherwise I would've stared at this for far
too long.

It's a little unpleasant because of the cast-orama going on before we
check the file pointer but I don't see that it's in any way wrong. And
given how focussed people are with __fget_* performance I think it might
even be the right thing to do.

But the cleverness means we have the same logic slightly differently
twice. Not too bad ofc but not too nice either especially because that
rcu lookup is pretty complicated already.

A few days ago I did just write a long explanatory off-list email to
someone who had questions about this and who is fairly experienced so
we're not making it easy on people. But performance or simplicity; one
can't necessarily always have both.


Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-27 Thread Christian Brauner
> I took a look at the code generation, and honestly, I think we're
> better off just making __fget_files_rcu() have special logic for this
> all, and not use __get_file_rcu().

My initial massaging of the patch did that btw. Then I sat there
wondering whether it would matter if we just made it possible to reuse
that code and I went through a bunch of iterations. Oh well, it seems to
matter.

> Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that

Concept looks sane to me.

> __get_file_rcu() does, and I don't get it. Both things come from
> volatile accesses, I don't see the point of those games, but I also
> didn't care, since it's no longer in a critical code path.
> 
> Christian?

Puts his completely imagined "I understand RCU head on".
SLAB_TYPESAFE_BY_RCU makes the RCU consume memory ordering that the
compiler doesn't officialy support (afaik) a bit wonky.

So the thinking was that we could have code patterns where you could
free the object and reallocate it while legitimatly passing the pointer
recheck. In that case there is no memory ordering between the allocation
and the pointer recheck because the last (re)allocation could have been
after the rcu_dereference().

To combat that all future loads were made to have a dependency on the
first load using the hidevar trick.

I guess that might only be theoretically possible but not in practice?
But then I liked that we explicitly commented on it as a reminder.


Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-26 Thread Oliver Sang
hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
>  wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
> 
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
> 
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
> 
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
> 
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot 

=
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
  c712b4365b ("Improve __fget_files_rcu() code generation (and thus 
__fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
 --- ---
 %stddev %change %stddev %change %stddev
 \  |\  |\
228481 ±  4%  -4.6% 217900 ±  6% -11.7% 201857 ±  5%  
meminfo.DirectMap4k
 89056-2.0%  87309-1.6%  87606
proc-vmstat.nr_slab_unreclaimable
 16.28-0.7%  16.16-1.0%  16.12
turbostat.RAMWatt
  0.01 ±  9%  +58125.6%   4.17 ±175%  +23253.5%   1.67 ±222%  
perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
781.67 ± 10%  +6.5% 832.50 ± 19% -14.3% 670.17 ±  4%  
perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
 97958 ±  7%  -9.7%  88449 ±  4%  -0.6%  97399 ±  4%  
sched_debug.cpu.avg_idle.stddev
  0.00 ± 12% +24.2%   0.00 ± 17%  -5.2%   0.00 ±  7%  
sched_debug.cpu.next_balance.stddev
   6391048-2.9%6208584+3.4%6605584
will-it-scale.16.threads
399440-2.9% 388036+3.4% 412848
will-it-scale.per_thread_ops
   6391048-2.9%6208584+3.4%6605584
will-it-scale.workload
 19.99 ±  4%  -2.2   17.74+1.2   21.18 ±  2%  
perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  1.27 ±  5%  +0.82.11 ±  3% +31.1   32.36 ±  2%  
perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
 32.69 ±  4%  +5.0   37.70   -32.70.00
perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  0.00   +27.9   27.85+0.00.00
perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
 20.00 ±  4%  -2.3   17.75+0.4   20.43 ±  2%  
perf-profile.children.cycles-pp.fput
  0.24 ± 10%  -0.10.18 ±  2%  -0.10.18 ± 10%  
perf-profile.children.cycles-pp.syscall_return_via_sysret
  1.48 ±  5%  +0.51.98 ±  3% +30.8   32.32 ±  2%  
perf-profile.children.cycles-pp.__fdget
 31.85 ±  4%  +6.0   37.86   -31.80.00
perf-profile.children.cycles-pp.__fget_light
  0.00   +27.7   27.67+0.00.00
perf-profile.children.cycles-pp.__get_file_rcu
 30.90 ±  4% -20.6   10.35 ±  2% -30.90.00
perf-profile.self.cycles-pp.__fget_light
 19.94 ±  4%  -2.4   17.53-0.3   19.62 ±  2%  
perf-profile.self.cycles-pp.fput
  9.81 ±  4%  -2.47.42 ±  2%  +1.7   11.51 ±  4%  
perf-profile.self.cycles-pp.do_poll
  0.23 ± 11%  -0.10.17 ±  4%  -0.10.18 ± 11%  
perf-profile.self.cycles-pp.syscall_return_via_sysret
  0.44 ±  7%  +0.00.45 ±  5%  +0.10.52 ±  4%  
perf-profile.self.cycles-pp.__poll
  0.85 ±  4%  +0.10.92 ±  3% +30.3   31.17 ±  2%  
perf-profile.self.cycles-pp.__fdget
  0.00   +26.5   26.48+0.00.00
perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%  +8.5%  2.329e+10 ±  2%  -2.1%  2.101e+10
perf-stat.i.branch-instructions
   

Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-26 Thread Linus Torvalds
On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
 wrote:
>
> IOW, I might have messed up some "trivial cleanup" when prepping for
> sending it out...

Bah. Famous last words. One of the "trivial cleanups" made the code
more "obvious" by renaming the nospec mask as just "mask".

And that trivial rename broke that patch *entirely*, because now that
name shadowed the "fmode_t" mask argument.

Don't even ask how long it took me to go from "I *tested* this,
dammit, now it doesn't work at all" to "Oh God, I'm so stupid".

So that nobody else would waste any time on this, attached is a new
attempt. This time actually tested *after* the changes.

  Linus
From 45f70b5413a654d20ead410c533ec1d604bdb1e2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Sun, 26 Nov 2023 12:24:38 -0800
Subject: [PATCH] Improve __fget_files_rcu() code generation (and thus __fget_light())

Commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") caused a
performance regression as reported by the kernel test robot.

The __fget_light() function is one of those critical ones for some
loads, and the code generation was unnecessarily impacted.  Let's just
write that function to better.

Reported-by: kernel test robot 
Cc: Christian Brauner 
Cc: Jann Horn 
Cc: Mateusz Guzik 
Closes: https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.s...@intel.com
Signed-off-by: Linus Torvalds 
---
 fs/file.c   | 48 +
 include/linux/fdtable.h | 15 -
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..608b4802c214 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long nospec;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		nospec = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(nospec & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(>f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *  file hasn't been reused yet or the file count
 		 *  isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-			continue;
-
-		/*
+		 *
 		 *  (b) the file table entry has changed under us.
 		 *   Note that we don't need to re-check the 'fdt->fd'
 		 *   pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+		if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+		unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	 * atomic_read_acquire() pairs with atomic_dec_and_test() in
 	 * put_files_struct().
 	 */
-	if (atomic_read_acquire(>count) == 1) {
+	if (likely(atomic_read_acquire(>count) == 1)) {
 		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget_files(files, fd, mask);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..80bd7789bab1 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
 static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+	struct file *needs_masking;
 
-	if (fd < fdt->max_fds) {
-		fd = array_index_nospec(fd, 

Re: [Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-26 Thread Linus Torvalds
On Sun, 19 Nov 2023 at 23:11, kernel test robot  wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops 
> on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to 
> SLAB_TYPESAFE_BY_RCU")

Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:

>  30.90 ą  4% -20.6   10.35 ą  2%  
> perf-profile.self.cycles-pp.__fget_light
>   0.00   +26.5   26.48
> perf-profile.self.cycles-pp.__get_file_rcu

and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.

I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().

The 'fd' case really is special because we need to do that
non-speculative pointer access.

Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.

So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:

/* Mask is a 0 for invalid fd's, ~0 for valid ones */
mask = array_index_mask_nospec(fd, fdt->max_fds);

and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):

/* fdentry points to the 'fd' offset, or fdt->fd[0] */
fdentry = fdt->fd + (fd);

and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:

/* Do the load, then mask any invalid result */
file = rcu_dereference_raw(*fdentry);
file = (void *)(mask & (unsigned long)file);

but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".

Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.

I made files_lookup_fd_raw() do the same thing.

The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.

Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.

Christian?

NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.

IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...

  Linus
 fs/file.c   | 48 ++--
 include/linux/fdtable.h | 15 ++-
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..c74a6e8687d9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		struct file *file;
 		struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 		struct file __rcu **fdentry;
+		unsigned long mask;
 
-		if (unlikely(fd >= fdt->max_fds))
+		/* Mask is a 0 for invalid fd's, ~0 for valid ones */
+		mask = array_index_mask_nospec(fd, fdt->max_fds);
+
+		/* fdentry points to the 'fd' offset, or fdt->fd[0] */
+		fdentry = fdt->fd + (fd);
+
+		/* Do the load, then mask any invalid result */
+		file = rcu_dereference_raw(*fdentry);
+		file = (void *)(mask & (unsigned long)file);
+
+		if (unlikely(!file))
 			return NULL;
 
-		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+		/*
+		 * Ok, we have a file pointer that was valid at
+		 * some point, but it might have become stale since.
+		 *
+		 * We need to confirm it by incrementing the refcount
+		 * and then check the lookup again.
+		 *
+		 * atomic_long_inc_not_zero() gives us a full memory
+		 * barrier. We only really need an 'acquire' one to
+		 * protect the loads below, but we don't have that.
+		 */
+		if (unlikely(!atomic_long_inc_not_zero(>f_count)))
+			continue;
 
 		/*
-		 * Ok, we have a file pointer. However, because we do
-		 * this all locklessly under RCU, we may be racing with
-		 * that file being closed.
-		 *
 		 * Such a race can take two forms:
 		 *
 		 *  (a) the file ref already went down to zero and the
 		 *  file hasn't been reused yet or the file count
 		 *  isn't zero but the file has already been reused.
-		 */
-		file = __get_file_rcu(fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(IS_ERR(file)))
-		

[Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-19 Thread kernel test robot



Hello,

kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:


commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to 
SLAB_TYPESAFE_BY_RCU")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 
2.90GHz (Cooper Lake) with 192G memory
parameters:

nr_task: 16
mode: thread
test: poll2
cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.s...@intel.com


Details are as below:
-->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231120/202311201406.2022ca3f-oliver.s...@intel.com

=
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit: 
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
 --- 
 %stddev %change %stddev
 \  |\  
  0.01 ±  9%  +58125.6%   4.17 ±175%  
perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
 89056-2.0%  87309proc-vmstat.nr_slab_unreclaimable
 97958 ±  7%  -9.7%  88449 ±  4%  sched_debug.cpu.avg_idle.stddev
  0.00 ± 12% +24.2%   0.00 ± 17%  
sched_debug.cpu.next_balance.stddev
   6391048-2.9%6208584will-it-scale.16.threads
399440-2.9% 388036will-it-scale.per_thread_ops
   6391048-2.9%6208584will-it-scale.workload
 19.99 ±  4%  -2.2   17.74
perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  1.27 ±  5%  +0.82.11 ±  3%  
perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
 32.69 ±  4%  +5.0   37.70
perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  0.00   +27.9   27.85
perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
 20.00 ±  4%  -2.3   17.75
perf-profile.children.cycles-pp.fput
  0.24 ± 10%  -0.10.18 ±  2%  
perf-profile.children.cycles-pp.syscall_return_via_sysret
  1.48 ±  5%  +0.51.98 ±  3%  
perf-profile.children.cycles-pp.__fdget
 31.85 ±  4%  +6.0   37.86
perf-profile.children.cycles-pp.__fget_light
  0.00   +27.7   27.67
perf-profile.children.cycles-pp.__get_file_rcu
 30.90 ±  4% -20.6   10.35 ±  2%  
perf-profile.self.cycles-pp.__fget_light
 19.94 ±  4%  -2.4   17.53perf-profile.self.cycles-pp.fput
  9.81 ±  4%  -2.47.42 ±  2%  
perf-profile.self.cycles-pp.do_poll
  0.23 ± 11%  -0.10.17 ±  4%  
perf-profile.self.cycles-pp.syscall_return_via_sysret
  0.00   +26.5   26.48
perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%  +8.5%  2.329e+10 ±  2%  perf-stat.i.branch-instructions
  0.22 ± 14%  -0.00.19 ± 14%  perf-stat.i.branch-miss-rate%
 1.404e+10 ±  2%  +8.7%  1.526e+10 ±  2%  perf-stat.i.dTLB-stores
 70.87-2.3   68.59perf-stat.i.iTLB-load-miss-rate%
   5267608-5.5%4979133 ±  2%  perf-stat.i.iTLB-load-misses
   2102507+5.4%2215725perf-stat.i.iTLB-loads
 18791 ±  3% +10.5%  20757 ±  2%  
perf-stat.i.instructions-per-iTLB-miss
266.67 ±  2%  +6.8% 284.75 ±  2%  perf-stat.i.metric.M/sec
  0.01 ± 10% -10.5%   0.01 ±  5%  perf-stat.overall.MPKI
  0.19-0.00.17
perf-stat.overall.branch-miss-rate%
  0.65-3.1%   0.63perf-stat.overall.cpi
  0.00 ±  4%  -0.00.00 ±  4%  
perf-stat.overall.dTLB-store-miss-rate%
 71.48-2.3   69.21
perf-stat.overall.iTLB-load-miss-rate%
 18757   +10.0%  20629
perf-stat.overall.instructions-per-iTLB-miss
  1.54+3.2%   1.59perf-stat.overall.ipc
   4795147+6.4%5100406perf-stat.overall.path-length
  2.14e+10 ±  2%  +8.5%  2.322e+10 ±  2%  perf-stat.ps.branch-instructions
   1.4e+10