On Tue, Sep 23, 2025 at 01:52:09PM +0200, Sumanth Korikkar wrote:
> Hi Lorenzo,
>
> The following tests causes the kernel to enter a blocked state,
> suggesting an issue related to locking order. I was able to reproduce
> this behavior in certain test runs.
>
> Test case:
> git clone https://github.com/libhugetlbfs/libhugetlbfs.git
> cd libhugetlbfs ; ./configure
> make -j32
> cd tests
> echo 100 > /proc/sys/vm/nr_hugepages
> mkdir -p /test-hugepages && mount -t hugetlbfs nodev /test-hugepages
> ./run_tests.py <in a loop>
> ...
> shm-fork 10 100 (1024K: 64):    PASS
> set shmmax limit to 104857600
> shm-getraw 100 /dev/full (1024K: 32):
> shm-getraw 100 /dev/full (1024K: 64):   PASS
> fallocate_stress.sh (1024K: 64):  <blocked>
>
> Blocked task state below:
>
> task:fallocate_stres state:D stack:0     pid:5106  tgid:5106  ppid:5103
> task_flags:0x400000 flags:0x00000001
> Call Trace:
>  [<00000255adc646f0>] __schedule+0x370/0x7f0
>  [<00000255adc64bb0>] schedule+0x40/0xc0
>  [<00000255adc64d32>] schedule_preempt_disabled+0x22/0x30
>  [<00000255adc68492>] rwsem_down_write_slowpath+0x232/0x610
>  [<00000255adc68922>] down_write_killable+0x52/0x80
>  [<00000255ad12c980>] vm_mmap_pgoff+0xc0/0x1f0
>  [<00000255ad164bbe>] ksys_mmap_pgoff+0x17e/0x220
>  [<00000255ad164d3c>] __s390x_sys_old_mmap+0x7c/0xa0
>  [<00000255adc60e4e>] __do_syscall+0x12e/0x350
>  [<00000255adc6cfee>] system_call+0x6e/0x90
> task:fallocate_stres state:D stack:0     pid:5109  tgid:5106  ppid:5103
> task_flags:0x400040 flags:0x00000001
> Call Trace:
>  [<00000255adc646f0>] __schedule+0x370/0x7f0
>  [<00000255adc64bb0>] schedule+0x40/0xc0
>  [<00000255adc64d32>] schedule_preempt_disabled+0x22/0x30
>  [<00000255adc68492>] rwsem_down_write_slowpath+0x232/0x610
>  [<00000255adc688be>] down_write+0x4e/0x60
>  [<00000255ad1c11ec>] __hugetlb_zap_begin+0x3c/0x70
>  [<00000255ad158b9c>] unmap_vmas+0x10c/0x1a0
>  [<00000255ad180844>] vms_complete_munmap_vmas+0x134/0x2e0
>  [<00000255ad1811be>] do_vmi_align_munmap+0x13e/0x170
>  [<00000255ad1812ae>] do_vmi_munmap+0xbe/0x140
>  [<00000255ad183f86>] __vm_munmap+0xe6/0x190
>  [<00000255ad166832>] __s390x_sys_munmap+0x32/0x40
>  [<00000255adc60e4e>] __do_syscall+0x12e/0x350
>  [<00000255adc6cfee>] system_call+0x6e/0x90
>
>
> Thanks,
> Sumanth

(been on holiday for a couple weeks and last week was a catch-up! :)

So having looked into this, the issue is that hugetlbfs exposes a per-VMA
hugetlbfs lock which can be taken via the rmap.

So, while faults are disallowed until the VMA is fully setup, the rmap is not,
and therefore there's a race between setting up the hugetlbfs lock and the rmap
trying to take/release it.

It's a real edge case as it's kind of unusual to have this requirement during
initial custom mmap, but to account for this and for any other users which might
require it, I have resolved this by introducing the ability to hold on to the
rmap lock until the VMA is fully set up.

The window is very very small, but obviously it's one we have to account for :)

This is the most correct solution I think, as it prevents any confusion as to
the state of the lock, rmap users simply cannot access the VMA until it is
established.

I am putting the finishing touches to a respin with this fix included, will cc
you on it.

Cheers, Lorenzo

Reply via email to