On Mon, Jan 05, 2026 at 11:42:26AM +0100, Vlastimil Babka wrote:
> On 1/4/26 22:17, Mikulas Patocka wrote:
> > If a process sets up a timer that periodically sends a signal in short
> > intervals and if it executes some kernel code that calls
> > mm_take_all_locks, we get random -EINTR failures.
> >
> > The function mm_take_all_locks fails with -EINTR if there is pending
> > signal. The -EINTR is propagated up the call stack to userspace and
> > userspace fails if it gets this error.
> >
> > In order to fix these failures, this commit changes
> > signal_pending(current) to fatal_signal_pending(current) in
> > mm_take_all_locks, so that it is interrupted only if the signal is
> > actually killing the process.
> >
> > For example, this bug happens when using OpenCL on AMDGPU. Sometimes,
> > probing the OpenCL device fails (strace shows that open("/dev/kfd")
> > failed with -EINTR). Sometimes we get the message "amdgpu:
> > init_user_pages: Failed to register MMU notifier: -4" in the syslog.
> >
> > The bug can be reproduced with the following program.
> >
> > To run this program, you need AMD graphics card and the package
> > "rocm-opencl" installed. You must not have the package "mesa-opencl-icd"
> > installed, because it redirects the default OpenCL implementation to
> > itself.
> >
> > include <stdio.h>
> > include <stdlib.h>
> > include <unistd.h>
> > include <string.h>
> > include <signal.h>
> > include <sys/time.h>
> >
> > define CL_TARGET_OPENCL_VERSION     300
> > include <CL/opencl.h>
> >
> > static void fn(void)
> > {
> >     while (1) {
> >             int32_t err;
> >             cl_device_id device;
> >             err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, 
> > NULL);
> >             if (err != CL_SUCCESS) {
> >                     fprintf(stderr, "clGetDeviceIDs failed: %d\n", err);
> >                     exit(1);
> >             }
> >             write(2, "-", 1);
> >     }
> > }
> >
> > static void alrm(int sig)
> > {
> >     write(2, ".", 1);
> > }
> >
> > int main(void)
> > {
> >     struct itimerval it;
> >     struct sigaction sa;
> >     memset(&sa, 0, sizeof sa);
> >     sa.sa_handler = alrm;
> >     sa.sa_flags = SA_RESTART;
> >     sigaction(SIGALRM, &sa, NULL);
> >     it.it_interval.tv_sec = 0;
> >     it.it_interval.tv_usec = 50;
> >     it.it_value.tv_sec = 0;
> >     it.it_value.tv_usec = 50;
> >     setitimer(ITIMER_REAL, &it, NULL);
> >     fn();
> >     return 1;
> > }
> >
> > I'm submitting this patch for the stable kernels, because this bug may
> > cause random failures in any code that calls mm_take_all_locks.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> > Link: 
> > https://lists.freedesktop.org/archives/amd-gfx/2025-November/133141.html
> > Link: 
> > https://yhbt.net/lore/linux-mm/[email protected]/T/#u
> > Cc: [email protected]
> > Fixes: 7906d00cd1f6 ("mmu-notifiers: add mm_take_all_locks() operation")
>
> Acked-by: Vlastimil Babka <[email protected]>

Also feel free to add:

Reviewed-by: Lorenzo Stoakes <[email protected]>

On assumption that nobody is explicitly relying on bizarre-o 'interrupt this if
_any_ signal arises'. But since it's making _actual workloads_ break I don't see
how this can be wrong.

>
> This makes sense to me as a backportable bugfix. But I wonder if going
> forward we should rather make all that locking killable instead of the
> hopeful checks between individual lock attempts.

Agreed. But anything like that should be a follow-up, let's get this
backported first.

>
> >
> > ---
> >  mm/vma.c |    8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > Index: mm/mm/vma.c
> > ===================================================================
> > --- mm.orig/mm/vma.c        2026-01-04 21:19:13.000000000 +0100
> > +++ mm/mm/vma.c     2026-01-04 21:19:13.000000000 +0100
> > @@ -2166,14 +2166,14 @@ int mm_take_all_locks(struct mm_struct *
> >      * is reached.
> >      */
> >     for_each_vma(vmi, vma) {
> > -           if (signal_pending(current))
> > +           if (fatal_signal_pending(current))
> >                     goto out_unlock;
> >             vma_start_write(vma);
>
> E.g. here I think we already added a killable variant recently?

Definitely unbackportable ;)

>
> >     }
> >
> >     vma_iter_init(&vmi, mm, 0);
> >     for_each_vma(vmi, vma) {
> > -           if (signal_pending(current))
> > +           if (fatal_signal_pending(current))
> >                     goto out_unlock;
> >             if (vma->vm_file && vma->vm_file->f_mapping &&
> >                             is_vm_hugetlb_page(vma))
> > @@ -2182,7 +2182,7 @@ int mm_take_all_locks(struct mm_struct *
> >
> >     vma_iter_init(&vmi, mm, 0);
> >     for_each_vma(vmi, vma) {
> > -           if (signal_pending(current))
> > +           if (fatal_signal_pending(current))
> >                     goto out_unlock;
> >             if (vma->vm_file && vma->vm_file->f_mapping &&
> >                             !is_vm_hugetlb_page(vma))
> > @@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
> >
> >     vma_iter_init(&vmi, mm, 0);
> >     for_each_vma(vmi, vma) {
> > -           if (signal_pending(current))
> > +           if (fatal_signal_pending(current))
> >                     goto out_unlock;
> >             if (vma->anon_vma)
> >                     list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> >
>
>

Reply via email to