Hi Jonathan,

On 11/5/25 7:02 PM, Jonathan Cameron wrote:

[...]


I already had the prototype of error source per vcpu, which works fine for
64KB-host-4KB-guest. However, it doesn't work for huge pages. For example,
a problematic 512MB huge page can cause heavy memory error storm to QEMU
where we absolutely can't handle.

1. Start the VM with hugetlb pages

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64                         
            \
-accel kvm -machine virt,gic-version=host,nvdimm=on,ras=on                      
            \
-cpu host -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1          
            \
-m 4096M,slots=16,maxmem=128G                                                   
            \
-object 
memory-backend-file,id=mem0,prealloc=on,mem-path=/dev/hugepages-524288kB,size=4096M
 \
-numa node,nodeid=0,cpus=0-7,memdev=mem0                                        
            \

2. Run 'victim -d' on guest

guest$ ./victim -d
physical address of (0xffff889d6000) = 0x11a7da000
Hit any key to trigger error:

3. Inject error from host

host$ errinjct 0x11a7da000

4. QEMU crashes with error message "Bus error (core dumped)", which is triggered
the following path.

sigbus_handler
    kvm_on_sigbus_vcpu           // have_sigbus_pending = 1
    sigbus_reraise

To me this sounds like something that should not be happening on the host unless
a real memory error is detected that blows away the whole of / most of a huge 
page.
I'm not sure we care about surviving that case if it isn't mapped using 
hugetlb/DAX or
similar in the guest (so contiguous in both with contained impact in both).

I assume the issue is backing with hugetlbfs which doesn't have a sub huge page 
granularity
for poison tracking.  I vaguely recall an effort to solve that
https://lore.kernel.org/linux-mm/[email protected]/
was the first thing google threw me. Looks like it got to v2.
https://lore.kernel.org/linux-mm/[email protected]/

+CC James.


For this particular case where the guest memory is backed by 512MB hugetlb 
pages.
There are 8 hugetlb pages since the guest has 4GB memory. I agree it's 
impossible
to recover from this extreme situation for a couple of factors: (1) A 
problematic
huge page is likely to be shared by multiple vCPUs. Multiple SIGBUS signals can 
be
raised at once, but we're unable to handle; (2) The instruction (TEXT section) 
of
guest's application or kernel can reside in the problematic huge page. Any 
instruction
fetch just leads to SIGBUS signal, meaning the vCPUs can't continue their 
executions.

I'm summarizing my findings for above case, to make this thread complete at 
least.

Only one pending SIGBUS signal is allowed by QEMU in current implementation. 
Otherwise,
it crashes in sigbus_handler() by a SIGBUS signal sent from sigbus_reraise().

  qemu
  ====
  sigbus_handler
    kvm_on_sigbus_vcpu
      have_sigbus_pending = true;
      qatomic_set(&cpu->exit_request, true)
           :
  kvm_cpu_exec
    kvm_cpu_kick_self
      kvm_cpu_kick
        qatomic_set(&cpu->kvm_run->immediate_exit, 1);
    kvm_vcpu_ioctl                                       // Return immediately
    kvm_arch_on_sigbus_vcpu
    have_sigbus_pending = true;

There are two SIGBUS signals raised by host before the target vCPU can be 
stopped. The
first one is raised by host when the memory error is handled.

  host
  ====
  memory_failure
    try_memory_failure_hugetlb
      get_huge_page_for_hwpoison
        __get_huge_page_for_hwpoison
          folio_set_hugetlb_hwpoison
    hwpoison_user_mappings
      collect_procs                                     // Collect tasks using 
the folio
      unmap_poisoned_folio
        try_to_unmap                                    // TTU_HWPOISON
          try_to_unmap_one
            mmu_notifier_invalidate_range_start
            swp_entry_to_pte(make_hwpoison_entry(subpage))
            set_huge_pte_at                             // Poisoned PMD
            mmu_notifier_invalidate_range_end
      kill_procs                                        // Raise SIGBUS
    identify_page_state


The second one is raised by the stage2 page fault handler due to the poisoned 
PMD.

  kvm_handle_guest_abort
    user_mem_abort
      __kvm_faultin_pfn
        kvm_follow_pfn
          hva_to_pfn
            hva_to_pfn_fast
            hva_to_pfn_slow
              get_user_pages_unlocked
                __get_user_pages_locked
                  __get_user_pages
                    follow_page_mask                      // No PMD mapping
                    faultin_page
                      handle_mm_fault
                        hugetlb_fault
                          is_hugetlb_entry_hwpoisoned     // Return 
VM_FAULT_HWPOISON_LARGE
      kvm_send_hwpoison_signal                            // Raise SIGBUS

Thanks,
Gavin


Reply via email to