Re: [PATCH] powerpc: Fix data corruption on IPI

Timothy Pearson Thu, 16 Nov 2023 23:40:06 -0800

----- Original Message -----
> From: "Michael Ellerman" <m...@ellerman.id.au>
> To: "Timothy Pearson" <tpear...@raptorengineering.com>, "linuxppc-dev" 
> <linuxppc-dev@lists.ozlabs.org>
> Cc: "Jens Axboe" <ax...@kernel.dk>
> Sent: Tuesday, November 14, 2023 6:14:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> Hi Timothy,
> 
> Thanks for debugging this, but I'm unclear why this is helping because
> we should already have a full barrier (hwsync) on both the sending and
> receiving side.
> 
> More below.

I've spent another few days poking at this, and think I might finally have 
something more solid in terms of what exactly is happening, but would like some 
feedback on the concept / how best to fix the potential problem.

As background, there are several worker threads both in userspace and in kernel 
mode.  Crucially, the main MariaDB data processing thread (the one that handles 
tasks like flushing dirty pages to disk) always runs on the same core as the 
io_uring kernel thread that picks up I/O worker creation requests and handles 
them via create_worker_cb().

Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring 
currently sets up the new process with create_io_thread(), and immediately uses 
an IPI to forcibly schedule the new process.  Because of the way the two 
threads interact, the new process ends up grabbing the CPU from the running 
MariaDB user thread; I've never seen it schedule on a different core.  If the 
timing is right in this process, things get trampled on in userspace and the 
database server either crashes or throws a corruption fault.

Through extensive debugging, I've narrowed this down to invalid state in the 
VSX registers on return to the MariaDB user thread from the new kernel thread.  
For some reason, it seems we don't restore FP state on return from the 
PF_IO_WORKER thread, and something in the kernel was busy writing new data to 
them.

A direct example I was able to observe is as follows:

xxspltd vs0,vs0,0      <-- vs0 now zeroed out
xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here 
due to the IPI
slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the 
value 0x820040000000000082004000
xxswapd vs8,vs0        <-- vs8 now has the wrong value
stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
stw     r9,116(r3)
stxvd2x vs8,r3,r0
...
CRASH

This is a very difficult race to hit, but MariaDB naturally does repetitive 
operations with VSX registers so it does eventually fail.  I ended up with a 
tight loop around glibc operations that use VSX to trigger the failure reliably 
enough to even understand what was going on.

As I am not as familiar with this part of the Linux kernel as with most other 
areas, what is the expected save/restore path for the FP/VSX registers around 
an IPI and associated forced thread switch?  If restore_math() is in the return 
path, note that MSR_FP is set in regs->msr.

Second question: should we even be using the VSX registers at all in kernel 
space?  Is this a side effect of io_uring interacting so closely with userspace 
threads, or something else entirely?

If I can get pointed somewhat in the right direction I'm ready to develop the 
rest of the fix for this issue...just trying to avoid another several days of 
slogging through the source to see what it's supposed to be doing in the first 
place. :)

Thanks!
Re: [PATCH] powerpc: Fix data corruption on IPI

Reply via email to