On Mon, 5 Jan 2026, Liam R. Howlett wrote:

> I may be missing context because I didn't get 1/3 of this patch set,
> nor can I find it in my ML searching.  Nor did I get the cover letter,
> or find it.  Is this series threaded?

You can ignore the patch 1/3, it just changes memory management test 
suite.

> What you are doing is changing a really horrible loop across all VMAs,
> that happens 4 times, to a less interruptible method.
> 
> I'm not sure I'm okay with this.  Everyone else does seem fine with it,
> because userspace basically never checks error codes for a retry (or
> really anything, and sometimes not even for an error at all).

Everyone else does seem fine because they don't use periodic signals :-)

OpenCL is not the first thing that got broken by periodic signals. In the 
past, I found bugs in Linux/alpha, Linux/hppa, Cygwin, Intel Software 
Developer's Emulator regarding periodic signals. fork() was also buggy and 
it was fixed by c3ad2c3b02e953ead2b8d52a0c9e70312930c3d0.

> But this could potentially have larger consequences for those
> applications that register signal handlers, couldn't it?  That is, they
> expect to get a return based on some existing code but now it won't
> return and the application is forced to wait for all locks to be taken
> regardless of how long it takes - or to force kill the application?

Do you have any userspace application that expects open("/dev/kfd") to be 
interrupted with -EINTR and breaks when it isn't?

> We regularly get people requesting the default number of vmas be
> increased.  This means that processes that approach max_map_count will
> wait until 4 loops through the vmas before they can be interrupted, or
> they have to kill the process.
> 
> > 
> > For example, this bug happens when using OpenCL on AMDGPU. Sometimes,
> > probing the OpenCL device fails (strace shows that open("/dev/kfd")
> > failed with -EINTR). Sometimes we get the message "amdgpu:
> > init_user_pages: Failed to register MMU notifier: -4" in the syslog.
> 
> If you only get the error message sometimes, does that mean there is
> another signal check that isn't covered by this change - or another call
> path?

This call path is also triggered by -EINTR from mm_take_all_locks: 
"init_user_pages -> amdgpu_hmm_register -> mmu_interval_notifier_insert -> 
mmu_notifier_register -> __mmu_notifier_register -> mm_take_all_locks -> 
return -EINTR". I am not expert in the GPU code, so I don't know how much 
serious it is.

> > The bug can be reproduced with the following program.
> > 
> > To run this program, you need AMD graphics card and the package
> > "rocm-opencl" installed. You must not have the package "mesa-opencl-icd"
> > installed, because it redirects the default OpenCL implementation to
> > itself.
> 
> I'm not saying it's wrong to change the signal handling, but this is
> very much working around a bug in userspace constantly hammering a task
> with signals and then is surprised there is a response that the kernel
> was interrupted.
> 
> This seems to imply that all signal handling should only happen on fatal
> signals?

No - the kernel should do what applications expect. open is (according to 
the man page) supposed to be interrupted when opening slow devices (for 
example fifo). I'm wondering whether /dev/kfd should be considered a slow 
device or not.

> ...
> > 
> > I'm submitting this patch for the stable kernels, because this bug may
> > cause random failures in any code that calls mm_take_all_locks.
> 
> They aren't random failures, they are a response to a signal sent to the
> process that may be taking a very long time to do something.
> 
> I really don't see how continuously sending signals at a short interval
> interrupting system calls can be considered random failures, especially
> when the return is -EINTR which literally means "Interrupted system
> call".  How else would you interrupt a system call, if not a signal?

The AMDGPU OpenCL implementation attempts to open /dev/kfd and if it gets 
-EINTR, it behaves as if OpenCL were unavailable - it won't report itself 
in clGetPlatformIDs and it will make clGetDeviceIDs fail.

So, we are dealing with random failures - any single signal received at 
wrong time can make OpenCL fail.

Even if I disabled the periodic timer, the failure could be triggered by 
other signals, for example SIGWINCH when the user resizes the terminal, or 
SIGCHLD when a subprocess exits.

> I feel like we are making the code less versatile to work around the
> fact that userspace didn't realise that system calls could be
> interrupted without a fatal signal.
> 
> And from that view, I consider this a functional change and not a bug
> fix.
> 
> Thanks,
> Liam

In practice, I use 10ms timer and I get occasional OpenCL failures. In the 
test example, I used 50us timer, so that it is reproduced reliably - but 
decreasing the timer frequency doesn't fix the failure, it just makes it 
happen less often.

Mikulas

Reply via email to