On 10/30/15 14:39, Laszlo Ersek wrote:
> On 10/30/15 14:04, Janusz Mocek wrote:
>> W dniu 30.10.2015 o 13:26, Laszlo Ersek pisze:
>>> CC'ing Xiao and Alex again.
>>>
>>> On 10/29/15 19:39, Jordan Justen wrote:
>>>> On 2015-10-29 04:45:37, Laszlo Ersek wrote:
>>>>> On 10/29/15 02:32, Jordan Justen wrote:
>>>>>> + ASSERT (MaxProcessors > 0);
>>>>>> + PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
>>>>> I think that when this branch is active, then
>>>>> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
>>>>> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
>>>>> this hint is available from QEMU, then we should practically disable
>>>>> the timeout option in CpuDxe's AP counting.
>>>> I think this is a good idea, but I don't think 71 minutes is useful.
>>>> Perhaps 30 seconds? This seems more than adequate for hundreds of
>>>> processors to startup. Or perhaps some timeout based on the number of
>>>> processors?
>>>>
>>>> Janusz and I were discussing
>>>> https://github.com/tianocore/edk2/issues/21 on irc. We increased the
>>>> timeout to 10 seconds, and with only 8 processors it was still timing
>>>> out.
>>>>
>>>> Obviously we are somehow failing to start the processors correctly, or
>>>> QEMU/KVM is doing something wrong.
>>>>
>>>> Have you been able to reproduce this issue? It seems like we need to
>>>> set the timeout to 71 minutes, and then debug QEMU/KVM to see what
>>>> state the APs are in...
>>>>
>>>> Unfortunately I haven't yet been able to reproduce the bug on my
>>>> system. :(
>>> I've been staring at the following things for a few tens of minutes now:
>>>
>>> (1) Kernel commit b18d5431acc7. Note that the commit changes the return
>>> value of the vmx_get_mt_mask() function *exactly* in the following
>>> case:
>>>
>>> kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
>>> (kvm_read_cr0(vcpu) & X86_CR0_CD)
>>>
>>> The first sub-condition is satisfied by GPU passthrough / device
>>> assignment, I think; the second part depends on the VCPU having
>>> turned on (or having *left* on) CR0.CD.
>>>
>>> (2) Consult the vmx_vcpu_reset() function in "arch/x86/kvm/vmx.c"
>>> (current upstream). You will find:
>>>
>>> cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
>>> vmx_set_cr0(vcpu, cr0); /* enter rmode */
>>>
>>> Meaning a VCPU will start with CD and NW set, in real mode, after
>>> re-set.
>>>
>>> This setting dates back to the birth of KVM:
>>>
>>> commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7
>>> Author: Avi Kivity <[email protected]>
>>> Date: Sun Dec 10 02:21:36 2006 -0800
>>>
>>> [PATCH] kvm: userspace interface
>>>
>>> Search that commit for "0x60000010" (the second hit, although the
>>> comment that contains the first hit is quite telling as well).
>>>
>>> (3) Consult the Intel SDM, Table 11-5. "Cache Operating Modes".
>>>
>>> The (CD, NW) == (1, 1) setting in CR0 is documented as:
>>> - "Memory coherency is not maintained."
>>> - "(P6 family and Pentium processors.) State of the processor after
>>> a power up or reset. "
>>> - [in footnote 2] "The Pentium 4 and more recent processor families
>>> do not support this mode; setting the CD and NW bits to 1 selects
>>> the no-fill cache mode."
>>>
>>> In other words, the settings implemented by vmx_vcpu_reset()
>>> actually invoke the behavior of the "no-fill cache mode" (which is
>>> (CD, NW) == (1, 0)) for all practical purposes.
>>>
>>> (4) Same reference.
>>>
>>> The (CD, NW) == (1, 0) setting in CR0 is documented as:
>>> - "No-fill Cache Mode. Memory coherency is maintained."
>>> - "(Pentium 4 and later processor families.) State of processor
>>> after a power up or reset. "
>>>
>>> (5) The AsmEnableCache() function in
>>> "MdePkg/Library/BaseLib/Ia32/EnableCache.c". It clears both CD and
>>> NW in CR0.
>>>
>>> (6) This setting ((CD, NW) == (0, 0))is documented in the Intel SDM as:
>>> - "Normal Cache Mode. Highest performance cache operation."
>>>
>>> (7) The AsmEnableCache() function is invoked by MtrrLib
>>> [UefiCpuPkg/Library/MtrrLib/MtrrLib.c] after any and all MTRR
>>> changes. Consider:
>>>
>>> PostMtrrChange() | MtrrSetAllMtrrs()
>>> PostMtrrChangeEnableCache()
>>> AsmEnableCache()
>>>
>>> Where MtrrSetAllMtrrs() is a public function of the library; plus
>>> PostMtrrChange() is invoked by all of the following public
>>> functions:
>>>
>>> - MtrrSetMemoryAttribute()
>>> - MtrrSetVariableMtrr()
>>> - MtrrSetFixedMtrr()
>>>
>>> (8) Because we call MtrrLib in PlatformPei first, there are two
>>> consequences:
>>>
>>> (a) The boot VCPU has CR0.CD *set* in all parts of OVMF that run
>>> earlier than that.
>>>
>>> This caused a widely reported boot perf regression in SEC (the
>>> LZMA decompression). Ultimately another MTRR change in KVM was
>>> reverted, so (as far as I know) this symptom has not been seen
>>> recently. (In any case, we should probably fix this sometime...)
>>>
>>> (b) The other consequence is that the boot VCPU's CR0.CD is clear in
>>> the rest of OVMF. Which is what makes its speed acceptable, I
>>> guess (as long as no APs are started up).
>>>
>>> (9) Our AP startup code massages CR0, but only for mode switches. CR0.CD
>>> and CR0.NW are never touched.
>>>
>>> Now, I guess this could be easily added to the assembly encoded as a
>>> C array ("mStartupCodeTemplate" in "UefiCpuPkg/CpuDxe/ApStartup.c")
>>> -- when cr0 is massaged anyway, just clear bits 29 and 30 too; same
>>> as in AsmEnableCache().
>>>
>>> However, for testing the idea, perhaps the following one-liner
>>> suffices too -- this is the earliest an AP executes C code:
>>>
>>>> diff --git a/UefiCpuPkg/CpuDxe/CpuMp.c b/UefiCpuPkg/CpuDxe/CpuMp.c
>>>> index 3f56faa..e7f5b41 100644
>>>> --- a/UefiCpuPkg/CpuDxe/CpuMp.c
>>>> +++ b/UefiCpuPkg/CpuDxe/CpuMp.c
>>>> @@ -1451,6 +1451,8 @@ ApEntryPointInC (
>>>> VOID* TopOfApStack;
>>>> UINTN ProcessorNumber;
>>>>
>>>> + AsmEnableCache ();
>>>> +
>>>> if (!mAPsAlreadyInitFinished) {
>>>> FillInProcessorInformation (FALSE, mMpSystemData.NumberOfProcessors);
>>>> TopOfApStack = (UINT8*)mApStackStart + gApStackSize;
>>> This should clear CR0.CD, and "undo" kernel commit b18d5431acc7 for
>>> the AP (by falsifying the second subcondition seen in (1)).
>>>
>>> Janusz, can you please test this one-liner (with no other out-of-tree
>>> patch applied)?
>>>
>> tested, didn't solved problem with detected cpu's
>
> Thanks for testing it. I'll try to reproduce the problem on my
> workstation next week.
* After reviewing the PCI devices and the IOMMU groups on my laptop, I
successfully assigned the following device to a guest:
02:00.0 SD Host controller: O2 Micro, Inc. SD/MMC Card Reader Controller (rev
01) (prog-if 01)
Subsystem: Lenovo Device 2211
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at b3901000 (32-bit, non-prefetchable) [size=4K]
Memory at b3900000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [6c] Power Management version 3
Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [200] Advanced Error Reporting
Capabilities: [230] Latency Tolerance Reporting
Kernel driver in use: vfio-pci
(With -n:
02:00.0 0805: 1217:8520 (rev 01) (prog-if 01)
Subsystem: 17aa:2211
)
* It is the sole device in its IOMMU group (13):
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/1/devices/0000:01:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:02.0
/sys/kernel/iommu_groups/3/devices/0000:00:03.0
/sys/kernel/iommu_groups/4/devices/0000:00:16.0
/sys/kernel/iommu_groups/5/devices/0000:00:19.0
/sys/kernel/iommu_groups/6/devices/0000:00:1a.0
/sys/kernel/iommu_groups/7/devices/0000:00:1b.0
/sys/kernel/iommu_groups/8/devices/0000:00:1c.0
/sys/kernel/iommu_groups/9/devices/0000:00:1c.1
/sys/kernel/iommu_groups/10/devices/0000:00:1c.4
/sys/kernel/iommu_groups/11/devices/0000:00:1d.0
/sys/kernel/iommu_groups/12/devices/0000:00:1f.0
/sys/kernel/iommu_groups/12/devices/0000:00:1f.2
/sys/kernel/iommu_groups/12/devices/0000:00:1f.3
/sys/kernel/iommu_groups/13/devices/0000:02:00.0
/sys/kernel/iommu_groups/14/devices/0000:03:00.0
* It is also not affected by any RMRR, according to the host dmesg.
* For completeness, the output of "lspci -tv":
-[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor
DRAM Controller
+-01.0-[01]----00.0 NVIDIA Corporation GK107GLM [Quadro K1100M]
+-02.0 Intel Corporation 4th Gen Core Processor Integrated Graphics
Controller
+-03.0 Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD
Audio Controller
+-16.0 Intel Corporation 8 Series/C220 Series Chipset Family MEI
Controller #1
+-19.0 Intel Corporation Ethernet Connection I217-LM
+-1a.0 Intel Corporation 8 Series/C220 Series Chipset Family USB
EHCI #2
+-1b.0 Intel Corporation 8 Series/C220 Series Chipset High
Definition Audio Controller
+-1c.0-[02]----00.0 O2 Micro, Inc. SD/MMC Card Reader Controller
+-1c.1-[03]----00.0 Intel Corporation Wireless 7260
+-1c.4-[06-3f]--
+-1d.0 Intel Corporation 8 Series/C220 Series Chipset Family USB
EHCI #1
+-1f.0 Intel Corporation QM87 Express LPC Controller
+-1f.2 Intel Corporation 8 Series/C220 Series Chipset Family 6-port
SATA Controller 1 [AHCI mode]
\-1f.3 Intel Corporation 8 Series/C220 Series Chipset Family SMBus
Controller
* In the UEFI shell, the "PCI" command confirms that the firmware enumerates
the device fine:
Shell> pci
Seg Bus Dev Func
--- --- --- ----
...
00 00 0A 00 ==> Base System Peripherals - SD Host controller
Vendor 1217 Device 8520 Prog Interface 1
(The qemu command line parameter is:
-device vfio-pci,host=02:00.0,id=hostdev0,bus=pci.0,addr=0xa
)
* From the enumeration log itself:
PciBus: Discovered PCI @ [00|0A|00]
BAR[0]: Type = Mem32; Alignment = 0xFFF; Length = 0x1000; Offset
= 0x10
BAR[1]: Type = Mem32; Alignment = 0xFFF; Length = 0x800; Offset = 0x14
...
PciBus: Resource Map for Root Bridge PciRoot(0x0)
...
Type = Mem32; Base = 0x80000000; Length = 0x1100000; Alignment =
0xFFFFFF
...
Base = 0x81000000; Length = 0x800; Alignment = 0xFFF; Owner = PCI
[00|0A|00:14]
Base = 0x81001000; Length = 0x1000; Alignment = 0xFFF; Owner =
PCI [00|0A|00:10]
* The VCPU topology for the guest is sockets=1, cores=4, threads=2, (total 8
logical processors). All of them are detected:
Detect CPU count: 8
* I'm not seeing any delays or errors. Some details about my config:
QEMU: upstream at bc79082e4cd12c1241fa03b0abceacf45f537740
Kernel: kvm/master at ad355e383d826e3506c3caaa0fe991fd112de47b
(with git-describe: v4.3-rc3-20-gad355e3)
edk2: SVN r18690 / git d26a7a3fa251e1c2e93bdb834207643eabb847de
(none of the recent experimental patches are applied)
Host: Lenovo ThinkPad W541
CPU: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
(family 6, model 60, stepping 3, microcode 0x1c)
topology matches the above VCPU topology: 1*4*2
I'm very sorry, but I don't think I can spend time on this, unless someone
gives me ssh and/or console access to a host that readily reproduces the bug,
with the latest kvm/master, qemu, and ekd2 builds.
I have hard experimental evidence that direct access is the only way to analyze
such bugs. For example, a few years ago I struggled with a nasty bug related to
ixgbevf passthrough on Xen, for *months*, on and off. Once the reporter gave me
ssh access to the box, the bug went down in *one day*.
https://bugzilla.redhat.com/show_bug.cgi?id=862862#c85
...
https://bugzilla.redhat.com/show_bug.cgi?id=862862#c116
Thanks
Laszlo
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel