On 7/17/23 10:32, Claudio Fontana wrote:
> Hello Igor,
> 
> thanks for getting back to me on this,
> 
> On 7/14/23 11:51, Igor Mammedov wrote:
>> On Wed, 5 Jul 2023 10:12:40 +0200
>> Claudio Fontana <cfont...@suse.de> wrote:
>>
>>> Hi all, partially resurrecting an old thread.
>>>
>>> I've seen how for Epyc something special was done in the past in terms of 
>>> apicid assignments based on topology, which was then reverted apparently,
>>> but I wonder if something more general would be useful to all?
>>>
>>> The QEMU apicid assignments first of all do not seem to match what is 
>>> happening on real hardware.
>>
>> QEMU typically does generate valid APIC IDs
>> it however doesn't do a good job when using odd number of cores and/or NUMA 
>> enabled cases.
> 
> 
> Right, this is what I meant, the QEMU assignment is generally a valid choice, 
> it just seems to differ from what (some) hardware/firmware does.
> 
> 
> 
> 
>> (That is what Babu have attempted to fix, but eventually that have been 
>> dropped for
>> reasons described in quoted cover letter)
>>
>>> Functionally things are ok, but then when trying to investigate issues, 
>>> specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
>>> in some cases the actual apicid values in relationship to the topology do 
>>> matter,
>>
>> Care to point out specific places you are referring to?
> 
> 
> What we wanted to do was to reproduce an issue that only happened when 
> booting our distro in the Cloud,
> but did not instead appear by booting locally (neither on bare metal nor 
> under QEMU/KVM).
> 
> In the end, after a lot of slow-turnaround research, the issue we encountered 
> was the already fixed:
>  
> commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e
> Author: Li RongQing <lirongq...@baidu.com>
> Date:   Wed Mar 9 16:35:44 2022 +0800
> 
>     KVM: x86: fix sending PV IPI
>     
>     If apic_id is less than min, and (max - apic_id) is greater than
>     KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
>     the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
>     should send the IPI.
>     
>     This is mostly theoretical, but it can happen if the apic_ids on three
>     iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
>     
>     Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
>     Signed-off-by: Li RongQing <lirongq...@baidu.com>
>     Message-Id: <1646814944-51801-1-git-send-email-lirongq...@baidu.com>
>     Cc: sta...@vger.kernel.org
>     Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> 
> 
> But this took a very long time to investigate, because the KVM PV code only 
> misbehaves with the old unpatched algorithm during boot
> if it encounters a specific sequence of ACPI IDs,
> where countrary to the comment in the commit, the issue can become very 
> practical depending on such ACPI IDs assignments as seen by the guest KVM PV 
> code.
> 
>>
>> KVM is not the only place where it might matter, it affects topo/numa code 
>> on guest side as well. 
>>
>>> and currently there is no way (I know of), of supplying our own apicid 
>>> assignment, more closely matching what happens on hardware.
>>>
>>> This has been an issue when debugging guest images in the cloud, where 
>>> being able to reproduce issues locally would be very beneficial as opposed 
>>> to using cloud images as the feedback loop,
>>> but unfortunately QEMU cannot currently create the right apicid values to 
>>> associate to the cpus.
>>
>> Indeed EPYC APIC encoding mess increases support cases load downstream,
>> but as long as one has access to similar host hw, one should be able
>> to reproduce the issue locally.
> 
> Unfortunately this does not seem to be always the case, with the case in 
> point being the kvm pv code,
> but I suspect other buggy guest code whose behaviour depends on APICIds and 
> APICId sequences must exist in more areas of the kernel.
> 
> In order to properly reproduce these kinds of issues locally, being able to 
> assign desired APICIDs to cpus in VMs would come very handy.
> 
> 
>> However I would expect end result on such support end with an advice
>> to change topo/use another CPU model.
>>
>> (what we lack is a documentation what works and what doesn't,
>> perhaps writing guidelines would be sufficient to steer users
>> to the usable EPYC configurations)
> 
> 
> In our case we encountered issues on Intel too.
> 
> 
>>
>>> Do I understand the issue correctly, comments, ideas?
>>> How receptive the project would be for changes aimed at providing a custom 
>>> assignment of apicids to cpus, regardless of Intel or AMD?
>>
>> It's not that simple to just set custom APIC ID in register and be done with 
>> it,
> 
> 
> right, I am under no illusion that this is going to be easy.
> 
> 
>> you'll likely break (from the top of my head: some CPUID leaves might
>> depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).
>>
>> Current topo code aims to work on information based on '-smp'/'-numa',
>> all through out QEMU codebase.
> 
> 
> just a thought, that information "-smp, -numa" could be optionally enriched 
> with additional info on apicids assignment
> 
> 
>> If however we were let user set APIC ID (which is somehow
>> correct), we would need to take reverse steps to decode that
>> (in vendor specific way) and incorporate resulting topo into other
>> code that uses topology info.
>> That makes it quite messy, not to mention it's x86(AMD specific) and
>> doesn't fit well with generalizing topo handling.
>> So I don't really like this route.
> 
> 
> I don't think I am suggesting something like described in the preceding 
> paragraph,
> but instead I would think that with the user providing the full apicid 
> assignment map, (in addition / as part of)  to the -smp, -numa options,
> all other pieces would be derived from that (I suppose ACPI tables, cpuid 
> leaves, and everything that the guest could see, plus the internal
> QEMU conversion functions between apicids and cpu index in topology.h
> 
> 
>>
>> (x86 cpus have apic_id property, so theoretically you can set it
>> and with some minimal hacking lunch a guest, but then
>> expect guest to be unhappy when ACPI ID goes out of sync with
>> everything else. I would do that only for the sake of an experiment
>> and wouldn't try to upstream that)
> 
> Right, it would all need to be consistent.
> 
>>
>> What I wouldn't mind is taking the 2nd stab at what Babu had tried
>> do. Provided it manages to encode APIC ID for EPYC correctly and won't
>> complicate code much (and still using -smp/-numa as the root source for
>> topo configuration).
> 
> 
> For the specific use case I am thinking (debugging with a guest-visible 
> topology that resembles a cloud one),
> I don't think that the EPYC-specific work would be sufficient, it would need 
> to be complemented in any case with Intel work,
> 
> but I suppose that a more general solution of the user providing all mappings 
> would be the best and easiest one for this debugging scenario.
> 
> Thanks for your thoughts,
> 
> Claudio

As a PS: I had a lot of typos where I wrote ACPI ID instead of APIC ID, hope it 
does not cause too much confusion..

Ciao,

C

Reply via email to