Thiago Jung Bauermann <bauer...@linux.ibm.com> writes:
> Alex Bennée <alex.ben...@linaro.org> writes: > >> Thiago Jung Bauermann <bauer...@linux.ibm.com> writes: >> >>> Eduardo Habkost <ehabk...@redhat.com> writes: >>> >>>> On Wed, Jul 08, 2020 at 09:11:55PM +0100, Peter Maydell wrote: >>>>> On Wed, 8 Jul 2020 at 18:36, Eduardo Habkost <ehabk...@redhat.com> wrote: >>>>> > >>>>> > On Wed, Jul 08, 2020 at 06:09:49PM +0100, Peter Maydell wrote: >>>>> > > Exactly. It appears that there's a bug in our mechanisms, >>>>> > > which is why I'm suggesting that the right thing is >>>>> > > to fix that bug rather than marking the CPU as halted >>>>> > > earlier in the reset process so that the KVM_RUN happens >>>>> > > to do nothing... >>>>> > >>>>> > I agree this is necessary, but it doesn't seem sufficient. >>>>> > >>>>> > Having cpu_reset() set halted=0 on spapr (and probably other >>>>> > machines) is also a bug, as it could still trigger unwanted >>>>> > KVM_RUN when cpu_reset() returns (and before machine code sets >>>>> > halted=1). >>>>> >>>>> The Arm handling of starting-halted sets halted=1 within cpu_reset, >>>>> based on whether the CPU object was created with a >>>>> "start-powered-off" property. >>>> >>>> Making this mechanism generic sounds like a good idea. >>> >>> I'll take a stab at doing that and using it for the spapr machine. >>> >>>>> I'm not sure in practice that anything can get in asynchronously >>>>> and cause a KVM_RUN in between spapr_reset_vcpu() calling >>>>> cpu_reset() and it setting cs->halted (and the other stuff), >>>>> though. This function ought to be called with the iothread >>>>> lock held, so KVM_RUN will only happen if it calls some >>>>> other function which incorrectly lets the CPU run. >>>> >>>> Yeah, maybe it won't happen in practice. It just seems fragile. >>>> The same way ppc_cpu_reset() kicked the CPU by accident, code >>>> outside cpu_reset() might one day kick the CPU by accident before >>>> setting halted=1. >>> >>> I'm seeing the vcpu being KVM_RUN'd too early twice during hotplug. >>> Both of them are before cpu_reset() and ppc_cpu_reset(). >>> >>> Here's the backtrace for the first of them (redacted for clarity): >>> >>> #0 in cpu_resume () >>> #1 in cpu_common_realizefn () >>> #2 in ppc_cpu_realize () >>> #3 in device_set_realized () >>> #4 in property_set_bool () >>> #5 in object_property_set () >>> #6 in object_property_set_qobject () >>> #7 in object_property_set_bool () >>> #8 in qdev_realize () >> <snip> >>> #18 in qmp_device_add () >> >> Is this a hotplug event? > > Yes, the way I reproduce the problem is starting a pseries guest with > `-smp 2,maxcpus=32,sockets=1,cores=16,threads=2` and then use qmp-shell to > send the command: > > device_add id=device-2 driver=host-spapr-cpu-core core-id=2 node-id=0 > >>> Here's the second: >>> >>> #0 in qemu_cpu_kick_thread () >>> #1 in qemu_cpu_kick () >>> #2 in queue_work_on_cpu () >>> #3 in async_run_on_cpu () >>> #4 in tlb_flush_by_mmuidx () >>> #5 in tlb_flush () >>> #6 in ppc_tlb_invalidate_all () >> >> FWIW tcg_flush_softmmu_tlb handles a tlb_flush in the common reset code. > > Ok, maybe KVM should be doing that too? Or maybe it does but pseries > isn't relying on it. I'll dig further. No tlb flush is a softmmu only thing. -- Alex Bennée