Hi Gavin,

On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <[email protected]> wrote:
>
> Hi Salil,
>
> On 10/1/25 11:01 AM, [email protected] wrote:
> > From: Salil Mehta <[email protected]>
> >
> > ARM CPU architecture does not allow CPUs to be plugged after system has
> > initialized. This is a constraint. Hence, the Kernel must know all the CPUs
> > being booted during its initialization. This applies to the Guest Kernel as
> > well and therefore, the number of KVM vCPU descriptors in the host must be
> > fixed at VM initialization time.
> >
> > Also, the GIC must know all the CPUs it is connected to during its
> > initialization, and this cannot change afterward. This must also be ensured
> > during the initialization of the VGIC in KVM. This is necessary because:
> >
> > 1. The association between GICR and MPIDR must be fixed at VM initialization
> >     time. This is represented by the register
> >     `GICR_TYPER(mp_affinity, proc_num)`.
> > 2. Memory regions associated with GICR, etc., cannot be changed (added,
> >     deleted, or modified) after the VM has been initialized. This is not an
> >     ARM architectural constraint but rather invites a difficult and messy
> >     change in VGIC data structures.
> >
> > To enable a hot-add–like model while preserving these constraints, the virt
> > machine may enumerate more CPUs than are enabled at boot using
> > `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
> > administratively disabled at init). The topology remains fixed at VM
> > creation time; only the online/offline status may change later.
> >
> > Administratively disabled vCPUs are not realized in QOM until first enabled,
> > avoiding creation of unnecessary vCPU threads at boot. On large systems, 
> > this
> > reduces startup time proportionally to the number of disabled vCPUs. Once a
> > QOM vCPU is realized and its thread created, subsequent enable/disable 
> > actions
> > do not unrealize it. This behaviour was adopted following review feedback 
> > and
> > differs from earlier RFC versions.
> >
> > Co-developed-by: Keqian Zhu <[email protected]>
> > Signed-off-by: Keqian Zhu <[email protected]>
> > Signed-off-by: Salil Mehta <[email protected]>
> > ---
> >   accel/kvm/kvm-all.c    |  2 +-
> >   hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
> >   hw/core/qdev.c         | 17 ++++++++++
> >   include/hw/qdev-core.h | 19 +++++++++++
> >   include/system/kvm.h   |  8 +++++
> >   target/arm/cpu.c       |  2 ++
> >   target/arm/kvm.c       | 40 +++++++++++++++++++++-
> >   target/arm/kvm_arm.h   | 11 ++++++
> >   8 files changed, 168 insertions(+), 8 deletions(-)
> >
[...]
> >
> > +static void
> > +virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
> > +{
> > +    /*
> > +     * Present & administratively disabled vCPUs:
> > +     *
> > +     * These CPUs are marked offline at init via '-smp disabledcpus=N'. We
> > +     * intentionally do not realize them during the first boot, since it is
> > +     * not known if or when they will ever be enabled. The decision to 
> > enable
> > +     * such CPUs depends on policy (e.g. guided by SLAs or other deployment
> > +     * requirements).
> > +     *
> > +     * Realizing all disabled vCPUs up front would make boot time 
> > proportional
> > +     * to 'maxcpus', even if policy permits only a small subset to be 
> > enabled.
> > +     * This can lead to unacceptable boot delays in some scenarios.
> > +     *
> > +     * Instead, these CPUs remain administratively disabled and unrealized 
> > at
> > +     * boot, to be instantiated and brought online only if policy later 
> > allows
> > +     * it.
> > +     */
> > +
> > +    /* set this vCPU to be administratively 'disabled' in QOM */
> > +    qdev_disable(DEVICE(cpuobj), NULL, &error_fatal);
> > +
> > +    if (vms->psci_conduit != QEMU_PSCI_CONDUIT_DISABLED) {
> > +        object_property_set_int(cpuobj, "psci-conduit", vms->psci_conduit,
> > +                                NULL);
> > +    }
> > +
> > +    /*
> > +     * [!] Constraint: The ARM CPU architecture does not permit new CPUs
> > +     * to be added after system initialization.
> > +     *
> > +     * Workaround: Pre-create KVM vCPUs even for those that are not yet
> > +     * online i.e. powered-off, keeping them `parked` and in an
> > +     * `unrealized (at-least during boot time)` state within QEMU until
> > +     * they are powered-on and made online.
> > +     */
> > +    if (kvm_enabled()) {
> > +        kvm_arm_create_host_vcpu(ARM_CPU(cpuobj));
> > +    }
> > +}
> > +
> >   static void machvirt_init(MachineState *machine)
> >   {
> >       VirtMachineState *vms = VIRT_MACHINE(machine);
> > @@ -2319,10 +2362,6 @@ static void machvirt_init(MachineState *machine)
> >           Object *cpuobj;
> >           CPUState *cs;
> >
> > -        if (n >= smp_cpus) {
> > -            break;
> > -        }
> > -
> >           cpuobj = object_new(possible_cpus->cpus[n].type);
> >           object_property_set_int(cpuobj, "mp-affinity",
> >                                   possible_cpus->cpus[n].arch_id, NULL);
> > @@ -2427,8 +2466,34 @@ static void machvirt_init(MachineState *machine)
> >               }
> >           }
> >
> > -        qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
> > -        object_unref(cpuobj);
> > +        /* start secondary vCPUs in a powered-down state */
> > +        if(n && mc->has_online_capable_cpus) {
> > +            object_property_set_bool(cpuobj, "start-powered-off", true, 
> > NULL);
> > +        }
> > +
> > +        if (n < smp_cpus) {
> > +            /* 'Present' & 'Enabled' vCPUs */
> > +            qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
> > +            object_unref(cpuobj);
> > +        } else {
> > +            /* 'Present' & 'Disabled' vCPUs */
> > +            virt_setup_lazy_vcpu_realization(cpuobj, vms);
> > +        }
> > +
> > +        /*
> > +         * All possible vCPUs should have QOM vCPU Object pointer & 
> > arch-id.
> > +         * 'cpus_queue' (accessed via qemu_get_cpu()) contains only 
> > realized and
> > +         * enabled vCPUs. Hence, we must now populate the 'possible_cpus' 
> > list.
> > +         */
> > +        if (kvm_enabled()) {
> > +            /*
> > +             * Override the default architecture ID with the one retrieved
> > +             * from KVM, as they currently differ.
> > +             */
> > +            machine->possible_cpus->cpus[n].arch_id =
> > +                arm_cpu_mp_affinity(ARM_CPU(cs));
> > +        }
> > +        machine->possible_cpus->cpus[n].cpu = cs;
> >       }
> >
> >       /* Now we've created the CPUs we can see if they have the hypvirt 
> > timer */
> > diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> > index 8502d6216f..5816abae39 100644
> > --- a/hw/core/qdev.c
> > +++ b/hw/core/qdev.c
> > @@ -309,6 +309,23 @@ void qdev_assert_realized_properly(void)
> >                                      qdev_assert_realized_properly_cb, 
> > NULL);
> >   }
> >

[...]

> > +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> > +{
> > +    CPUState *cs = CPU(cpu);
> > +    unsigned long vcpu_id = cs->cpu_index;
> > +    int ret;
> > +
> > +    ret = kvm_create_vcpu(cs);
> > +    if (ret < 0) {
> > +        error_report("Failed to create host vcpu %ld", vcpu_id);
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * Initialize the vCPU in the host. This will reset the sys regs
> > +     * for this vCPU and related registers like MPIDR_EL1 etc. also
> > +     * get programmed during this call to host. These are referenced
> > +     * later while setting device attributes of the GICR during GICv3
> > +     * reset.
> > +     */
> > +    ret = kvm_arch_init_vcpu(cs);
> > +    if (ret < 0) {
> > +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * park the created vCPU. shall be used during kvm_get_vcpu() when
> > +     * threads are created during realization of ARM vCPUs.
> > +     */
> > +    kvm_park_vcpu(cs);
> > +}
> > +
>
> I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily 
> realized
> path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper 
> machine where
> SVE is supported by default.

Thanks for reporting this. That is not true. As long as we initialize
KVM correctly and
finalize the features like SVE we should be fine. In fact, this is
precisely what we are
doing right now.

To understand the crash, I need a bit more info.

1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
      within the KVM if it is happening because
     a.  features specified by QEMU are not matching the defaults within the KVM
           (HInt: check kvm_vcpu_init_check_features())?
     b. or complaining about init feate change kvm_vcpu_init_changed()?
2#  or it is happening during the setting of vector length or
finalizing features?

int kvm_arch_init_vcpu(CPUState *cs)
{
   [...]
         /* Do KVM_ARM_VCPU_INIT ioctl */
        ret = kvm_arm_vcpu_init(cpu);   ---->[1]
        if (ret) {
           return ret;
       }
          if (cpu_isar_feature(aa64_sve, cpu)) {
        ret = kvm_arm_sve_set_vls(cpu); ---->[2]
        if (ret) {
            return ret;
        }
        ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
        if (ret) {
            return ret;
        }
    }
[...]
}

I think it's happening because vector length is going uninitialized.
This initialization
happens in context to  arm_cpu_finalize_features() which I forgot to call before
calling KVM finalize.

>
> kvm_arch_init_vcpu() is supposed to be called in the realization path in 
> current
> implementation (without this series) because the parameters (features) to 
> KVM_ARM_VCPU_INIT
> is populated at vCPU realization time.

Not necessarily. It is just meant to initialize the KVM. If we take care of the
KVM requirements in the similar way the realize path does we should be
fine. Can you try to add the patch below in your code and test if it works?

 diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index c4b68a0b17..1091593478 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
         abort();
     }

+     /* finalize the features like SVE, SME etc */
+     arm_cpu_finalize_features(cpu, &error_abort);
+
     /*
      * Initialize the vCPU in the host. This will reset the sys regs
      * for this vCPU and related registers like MPIDR_EL1 etc. also




>
> $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
>    --enable-kvm -machine virt,gic-version=3 -cpu host               \
>    -smp cpus=4,disabledcpus=2 -m 1024M                              \
>    -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
>    -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
> qemu-system-aarch64: Failed to initialize host vcpu 4
> Aborted (core dumped)
>
> Backtrace
> =========
> (gdb) bt
> #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
> #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
> #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
> #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
>      at ../target/arm/kvm.c:1081
> #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization 
> (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
>      at ../hw/arm/virt.c:2483
> #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at 
> ../hw/arm/virt.c:2777
> #6  0x0000aaaab160f220 in machine_run_board_init
>      (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at 
> ../hw/core/machine.c:1722
> #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
> #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 
> <error_fatal>)
>      at ../system/vl.c:2821
> #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at 
> ../system/vl.c:3882
> #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at 
> ../system/main.c:71


Thank you for this. Please let me know if the above fix works and also
the return values in
case you encounter errors.

Many thanks!
Salil.


>
> Thanks,
> Gavin
>
> >   /*
> >    * Update KVM's MP_STATE based on what QEMU thinks it is
> >    */
> > @@ -1876,7 +1908,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >           return -EINVAL;
> >       }
> >
> > -    qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
> > +    /*
> > +     * Install VM change handler only when vCPU thread has been spawned
> > +     * i.e. vCPU is being realized
> > +     */
> > +    if (cs->thread_id) {
> > +        qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
> > +    }
> >
> >       /* Determine init features for this CPU */
> >       memset(cpu->kvm_init_features, 0, sizeof(cpu->kvm_init_features));
> > diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
> > index 6a9b6374a6..ec9dc95ee8 100644
> > --- a/target/arm/kvm_arm.h
> > +++ b/target/arm/kvm_arm.h
> > @@ -98,6 +98,17 @@ bool kvm_arm_cpu_post_load(ARMCPU *cpu);
> >   void kvm_arm_reset_vcpu(ARMCPU *cpu);
> >
> >   struct kvm_vcpu_init;
> > +
> > +/**
> > + * kvm_arm_create_host_vcpu:
> > + * @cpu: ARMCPU
> > + *
> > + * Called to pre-create possible KVM vCPU within the host during the
> > + * `virt_machine` initialization phase. This pre-created vCPU will be 
> > parked and
> > + * will be reused when ARM QOM vCPU is actually hotplugged.
> > + */
> > +void kvm_arm_create_host_vcpu(ARMCPU *cpu);
> > +
> >   /**
> >    * kvm_arm_create_scratch_host_vcpu:
> >    * @fdarray: filled in with kvmfd, vmfd, cpufd file descriptors in that 
> > order
>

Reply via email to