On Tue, 6 Jan 2026, Liam R. Howlett wrote:

> * Mikulas Patocka <[email protected]> [260105 15:08]:
> > 
> > > If you only get the error message sometimes, does that mean there is
> > > another signal check that isn't covered by this change - or another call
> > > path?
> > 
> > This call path is also triggered by -EINTR from mm_take_all_locks: 
> > "init_user_pages -> amdgpu_hmm_register -> mmu_interval_notifier_insert -> 
> > mmu_notifier_register -> __mmu_notifier_register -> mm_take_all_locks -> 
> > return -EINTR". I am not expert in the GPU code, so I don't know how much 
> > serious it is.
> 
> Okay, so the other call paths also end up getting the -EINTR from this
> function?  Can you please add that detail to the commit message?

Yes. I'd like to ask the GPU people to look at it and say how much damage 
this -EINTR could do. I don't know - I just saw the messages "Failed to 
register MMU notifier: -4" in the syslog.

> This means that -EINTR can no longer be returned from open(), right?
> Otherwise you are just reducing a race condition between open() and a
> signal entering from your timer.

EINTR can be returned from open() in cases when it was historically 
behaving this way - such as opening a fifo when there is no matching 
process having it open.

But I think that opening /dev/kfd doesn't fall into this category.

NFS has an "intr" flag that makes the filesystem syscalls interruptible by 
signals. It is off by default, because many programs don't expect EINTR 
when opening, reading or writing plain files on a filesystem.

> Any other -EINTR system call will also cause you problems since you
> continuously send signals to your process, so we'll have to change them
> all for this to work?

I use SA_RESTART for the signals. And I retry all the syscalls on EINTR 
just in case SA_RESTART didn't work. So, I don't experience random 
failures in my code due to the periodic signal.

But there is code that I have no control over - such as the OpenCL shared 
library.

> This is the userspace ignoring what the error code means and just
> aborting on any error.  This is a change in behaviour on the kernel side
> to work around what they are doing.
> 
> It also sounds like it can be avoided by userspace not sending signals
> during the open process, or even to

So far, I worked around this issue by blocking all signals around 
clGetPlatformIDs and clGetDeviceIDs - but this is a hack.

> retry at a higher level if a recoverable error occurs.

If clGetDeviceIDs fails and I call clGetDeviceIDs again, it doesn't even 
attempt to open /dev/kfd again and fails right away. So, I can't work 
around it by retrying it.

> > Even if I disabled the periodic timer, the failure could be triggered by 
> > other signals, for example SIGWINCH when the user resizes the terminal, or 
> > SIGCHLD when a subprocess exits.
> 
> Those are also not random, they are expected signals caused by events.

>From the process's point of view, they are random - the process doesn't 
know when the user will drag the corner of the terminal window and resize 
it. If the process spawns a subprocess, it cannot predict when will the 
subprocess exit and SIGCHLD will be delivered.

If we don't change it, we end up with unreliable software stack that can 
fail during rare events, such as dragging the corner of the window.

> I'm trying to say this git commit message is wrong and misleading.

OK, so I'll try to rewrite the commit message and submit version 4 of the 
patch.

Mikulas

Reply via email to