On 7/17/23 12:37, Igor Mammedov wrote:
> On Mon, 17 Jul 2023 10:32:33 +0200
> Claudio Fontana <cfont...@suse.de> wrote:
> 
>> Hello Igor,
>>
>> thanks for getting back to me on this,
>>
>> On 7/14/23 11:51, Igor Mammedov wrote:
>>> On Wed, 5 Jul 2023 10:12:40 +0200
>>> Claudio Fontana <cfont...@suse.de> wrote:
>>>   
>>>> Hi all, partially resurrecting an old thread.
>>>>
>>>> I've seen how for Epyc something special was done in the past in terms of 
>>>> apicid assignments based on topology, which was then reverted apparently,
>>>> but I wonder if something more general would be useful to all?
>>>>
>>>> The QEMU apicid assignments first of all do not seem to match what is 
>>>> happening on real hardware.  
>>>
>>> QEMU typically does generate valid APIC IDs
>>> it however doesn't do a good job when using odd number of cores and/or NUMA 
>>> enabled cases.  
>>
>>
>> Right, this is what I meant, the QEMU assignment is generally a valid 
>> choice, it just seems to differ from what (some) hardware/firmware does.
>>
>>
>>
>>
>>> (That is what Babu have attempted to fix, but eventually that have been 
>>> dropped for
>>> reasons described in quoted cover letter)
>>>   
>>>> Functionally things are ok, but then when trying to investigate issues, 
>>>> specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
>>>> in some cases the actual apicid values in relationship to the topology do 
>>>> matter,  
>>>
>>> Care to point out specific places you are referring to?  
>>
>>
>> What we wanted to do was to reproduce an issue that only happened when 
>> booting our distro in the Cloud,
>> but did not instead appear by booting locally (neither on bare metal nor 
>> under QEMU/KVM).
>>
>> In the end, after a lot of slow-turnaround research, the issue we 
>> encountered was the already fixed:
>>  
>> commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e
>> Author: Li RongQing <lirongq...@baidu.com>
>> Date:   Wed Mar 9 16:35:44 2022 +0800
>>
>>     KVM: x86: fix sending PV IPI
>>     
>>     If apic_id is less than min, and (max - apic_id) is greater than
>>     KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
>>     the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
>>     should send the IPI.
>>     
>>     This is mostly theoretical, but it can happen if the apic_ids on three
>>     iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
>>     
>>     Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
>>     Signed-off-by: Li RongQing <lirongq...@baidu.com>
>>     Message-Id: <1646814944-51801-1-git-send-email-lirongq...@baidu.com>
>>     Cc: sta...@vger.kernel.org
>>     Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>>
>>
>> But this took a very long time to investigate, because the KVM PV code only 
>> misbehaves with the old unpatched algorithm during boot
>> if it encounters a specific sequence of ACPI IDs,
>> where countrary to the comment in the commit, the issue can become very 
>> practical depending on such ACPI IDs assignments as seen by the guest KVM PV 
>> code.
>>
>>>
>>> KVM is not the only place where it might matter, it affects topo/numa code 
>>> on guest side as well. 
>>>   
>>>> and currently there is no way (I know of), of supplying our own apicid 
>>>> assignment, more closely matching what happens on hardware.
>>>>
>>>> This has been an issue when debugging guest images in the cloud, where 
>>>> being able to reproduce issues locally would be very beneficial as opposed 
>>>> to using cloud images as the feedback loop,
>>>> but unfortunately QEMU cannot currently create the right apicid values to 
>>>> associate to the cpus.  
>>>
>>> Indeed EPYC APIC encoding mess increases support cases load downstream,
>>> but as long as one has access to similar host hw, one should be able
>>> to reproduce the issue locally.  
>>
>> Unfortunately this does not seem to be always the case, with the case in 
>> point being the kvm pv code,
>> but I suspect other buggy guest code whose behaviour depends on APICIds and 
>> APICId sequences must exist in more areas of the kernel.
>>
>> In order to properly reproduce these kinds of issues locally, being able to 
>> assign desired APICIDs to cpus in VMs would come very handy.
>>
>>
>>> However I would expect end result on such support end with an advice
>>> to change topo/use another CPU model.
>>>
>>> (what we lack is a documentation what works and what doesn't,
>>> perhaps writing guidelines would be sufficient to steer users
>>> to the usable EPYC configurations)  
>>
>>
>> In our case we encountered issues on Intel too.
> 
> on Intel I've seen issues only in cases where specified topo wasn't
> really supported by given CPU model (Intel CPUs are far from perfect
> and kernel has a few quirks when it comes to handling topo (
> the same as it has quirks for AMD cpus)).
> On QEMU level we hasn't bothered implementing topo quirks
> (modulo EPYC attempt, which we've later decided not worth of
> complexity of supporting it)
> 
> 
>>>   
>>>> Do I understand the issue correctly, comments, ideas?
>>>> How receptive the project would be for changes aimed at providing a custom 
>>>> assignment of apicids to cpus, regardless of Intel or AMD?  
>>>
>>> It's not that simple to just set custom APIC ID in register and be done 
>>> with it,  
>>
>>
>> right, I am under no illusion that this is going to be easy.
>>
>>
>>> you'll likely break (from the top of my head: some CPUID leaves might
>>> depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).
>>>
>>> Current topo code aims to work on information based on '-smp'/'-numa',
>>> all through out QEMU codebase.  
>>
>>
>> just a thought, that information "-smp, -numa" could be optionally enriched 
>> with additional info on apicids assignment
>>
>>
>>> If however we were let user set APIC ID (which is somehow
>>> correct), we would need to take reverse steps to decode that
>>> (in vendor specific way) and incorporate resulting topo into other
>>> code that uses topology info.
>>> That makes it quite messy, not to mention it's x86(AMD specific) and
>>> doesn't fit well with generalizing topo handling.
>>> So I don't really like this route.  
>>
>>
>> I don't think I am suggesting something like described in the preceding 
>> paragraph,
>> but instead I would think that with the user providing the full apicid 
>> assignment map, (in addition / as part of)  to the -smp, -numa options,
>> all other pieces would be derived from that (I suppose ACPI tables, cpuid 
>> leaves, and everything that the guest could see, plus the internal
>> QEMU conversion functions between apicids and cpu index in topology.h
> 
> As I read it, you described the same approach as me, just in different words.
> 
> ACPI ID, is a function of -smp/-numa (and sometimes -cpu, which we ignore 
> atm).
> Pushing APIC ID up to a user visible interface duplicates that information.
> That's would be my main objection to the approach.


The user would have to provide at least that specific "function", ie how those 
-smp and -numa map to apicids.


> 
> Also my educated guess tells me that it would even more complicate
> already not easy to deal with topo code.
> 
> If you really wish QEMU emulate FOO hypervisor, create a dedicated
> board for it, that behaves as FOO and that could generate the same
> topology based on -smp/-numa. (That approach might be acceptable,
> assuming it self-containing and doesn't complicate common code a lot)

hmm interesting.

> 
>>> (x86 cpus have apic_id property, so theoretically you can set it
>>> and with some minimal hacking lunch a guest, but then
>>> expect guest to be unhappy when ACPI ID goes out of sync with
>>> everything else. I would do that only for the sake of an experiment
>>> and wouldn't try to upstream that)  
>>
>> Right, it would all need to be consistent.
>>
>>>
>>> What I wouldn't mind is taking the 2nd stab at what Babu had tried
>>> do. Provided it manages to encode APIC ID for EPYC correctly and won't
>>> complicate code much (and still using -smp/-numa as the root source for
>>> topo configuration).  
>>
>>
>> For the specific use case I am thinking (debugging with a guest-visible 
>> topology that resembles a cloud one),
>> I don't think that the EPYC-specific work would be sufficient, it would need 
>> to be complemented in any case with Intel work,
>>
>> but I suppose that a more general solution of the user providing all 
>> mappings would be the best and easiest one for this debugging scenario.
> 
> All of that for the sake of debugging some thirdparty hypervisor
> issues (which in my opinion doesn't benefit QEMU* at all).
> So do we really need this in QEMU code-base?
> 
> I think about this use-case as 1-off thingy, where I can quickly hack
> QEMU by hardcodding desired specifics to get where I want, but not
> something that I would try to upstream as it's not useful for the project
> in general.

I don't think debugging cloud issues will be a one-off, I expect this problem 
to appear more and more,

in any case thanks,

Claudio

> 
> * under QEMU here I mean all the folks who have to read/touch/maintain
> this code. (not to mention poor users who decided use this feature).
> 
>> Thanks for your thoughts,
>>
>> Claudio
>>>>
>>>>
>>>> On 9/1/20 17:57, Babu Moger wrote:  
>>>>> To support some of the complex topology, we introduced EPYC mode apicid 
>>>>> decode.
>>>>> But, EPYC mode decode is running into problems. Also it can become quite a
>>>>> maintenance problem in the future. So, it was decided to remove that code 
>>>>> and
>>>>> use the generic decode which works for majority of the topology. Most of 
>>>>> the
>>>>> SPECed configuration would work just fine. With some non-SPECed user 
>>>>> inputs,
>>>>> it will create some sub-optimal configuration.
>>>>>
>>>>> Here is the discussion thread.
>>>>> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b...@amd.com/
>>>>> https://lore.kernel.org/qemu-devel/20200826143849.59f69...@redhat.com/
>>>>>
>>>>> This series removes all the EPYC mode specific apicid changes and use the 
>>>>> generic
>>>>> apicid decode.
>>>>> ---
>>>>> v7:
>>>>>  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
>>>>>  Fixed CPUID 800000ld based on Igor's comment and few text changes.
>>>>>  
>>>>> v6:
>>>>>  
>>>>> https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.st...@naples-babu.amd.com/
>>>>>  Found out that numa configuration is not mandatory for all the EPYC 
>>>>> model topology.
>>>>>  We can use the generic decode which works pretty well. Also noticed that
>>>>>  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
>>>>>  Took care of couple comments from Igor and Eduardo.
>>>>>  Thank you Igor, Daniel, David, Eduardo for your feedback.  
>>>>>
>>>>> v5:
>>>>>  
>>>>> https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.st...@naples-babu.amd.com/
>>>>>  Revert EPYC specific decode.
>>>>>  Simplify CPUID_8000_001E
>>>>>
>>>>> v4:
>>>>>   
>>>>> https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.st...@naples-babu.amd.com/
>>>>>   Not much of a change. Just added few text changes.
>>>>>   Error out configuration instead of warning if dies are not configured 
>>>>> in EPYC.
>>>>>   Few other text changes to clarify the removal of node_id, nr_nodes and 
>>>>> nodes_per_pkg.
>>>>>
>>>>> v3:
>>>>>   
>>>>> https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.st...@naples-babu.amd.com/#r
>>>>>   Added a new check to pass the dies for EPYC numa configuration.
>>>>>   Added Simplify CPUID_8000_001E patch with some changes suggested by 
>>>>> Igor.
>>>>>   Dropped the patch to build the topology from CpuInstanceProperties.
>>>>>   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
>>>>>   Needs more clarity on that.
>>>>>
>>>>> v2:
>>>>>   
>>>>> https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.st...@naples-babu.amd.com/
>>>>>   Used the numa information from CpuInstanceProperties for building
>>>>>   the apic_id suggested by Igor.
>>>>>   Also did some minor code re-aarangement to take care of changes.
>>>>>   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
>>>>>   it later.
>>>>>
>>>>> v1:
>>>>>  
>>>>> https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.st...@naples-babu.amd.com
>>>>>
>>>>> Babu Moger (2):
>>>>>       i386: Simplify CPUID_8000_001d for AMD
>>>>>       i386: Simplify CPUID_8000_001E for AMD
>>>>>
>>>>>
>>>>>  target/i386/cpu.c |  226 
>>>>> ++++++++++++++---------------------------------------
>>>>>  1 file changed, 61 insertions(+), 165 deletions(-)
>>>>>
>>>>> --
>>>>>     
>>>>  
>>>   
>>
> 


Reply via email to