On 2015-11-02 17:14:12, Laszlo Ersek wrote: > On 11/02/15 23:49, Laszlo Ersek wrote: > > On 11/02/15 23:29, Jordan Justen wrote: > >> On 2015-11-02 12:16:12, Laszlo Ersek wrote: > >>> On 11/02/15 19:53, Alex Williamson wrote: > >>>> On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote: > >>>>> On 10/31/15 18:50, Laszlo Ersek wrote: > >>>>> > >>>>>> I'm very sorry, but I don't think I can spend time on this, > >>>>>> unless someone gives me ssh and/or console access to a host that > >>>>>> readily reproduces the bug, with the latest kvm/master, qemu, and > >>>>>> ekd2 builds. > >>>>> > >>>>> We just got lucky, the problem reproduces on my oldie workstation > >>>>> (HP Z400, (R) Xeon(R) CPU W3550 @ 3.07GHz, > >>>>> family/model/stepping/microcode = 6/26/5/0x19). > >>>>> > >>>>> OVMF works fine with the most recent Fedora 21 host kernel, > >>>>> "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new > >>>>> upstream Linux 4.3 release. > >>>>> > >>>>> The OVMF log contains a number of failed ASSERT()s, printed by the > >>>>> APs, in a way that is very characteristic of multi-threading bugs: > >>>>> > >>>>> --*-- > >>>>> > >>>>> Detect CPU count: 2 > >>>>> AASSSSEERRTT > >>>>> //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9 > >>>>> 8L)o:c kLVoaclkuVea l=u=e (=(=U I(N(TUNI)N T2N)) |2|) L|o|c > >>>>> kLVoaclkuVea l=u=e (=(=U I(N(TUNI)N T1N))^M > >>>>> 1) > >>>>> ASSERT > >>>>> /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198): > >>>>> LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1) > >>>>> > >>>>> --*-- > >>>>> > >>>>> (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count: > >>>>> 2".) > >>>>> > >>>>> > >>>>> Maybe important, maybe not: my workstation's IOMMU does *not* have > >>>>> snoop control: > >>>>> > >>>>> [ 3.043297] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap > >>>>> c90780106f0462 ecap f02076 > >>>>> > >>>>> Where 0xf02076 & 0x80 == 0. > >>>>> > >>>>> Whereas my laptop *does* have Snoop Control: > >>>>> > >>>>> [ 0.030486] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap > >>>>> c0000020660462 ecap f0101a > >>>>> [ 0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap > >>>>> d2008020660462 ecap f010da > >>>>> > >>>>> The first IOMMU doesn't have Snoop Control, but its scope only > >>>>> extends to the integrated Intel Graphics Device. The second IOMMU, > >>>>> which covers the SD card reader I used for testing (and all other > >>>>> PCI devices), *does* have Snoop Control. 0xf010da & 0x80 = 0x80. > >>>>> > >>>>> However, SC support should not matter, due to kernel commit > >>>>> fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be > >>>>> looking an independent issue. (Possibly in CpuDxe, only exposed by > >>>>> the new host > >>>>> kernel.) > >>>> > >>>> Snoop Control is what I immediately jumped to as well, but I've > >>>> never seen the problem on my desktop, which reports the following > >>>> ecaps: f0101a & f0105a. Thanks, > >>> > >>> This issue is actually very easy to reproduce on my workstation. It > >>> has 1 socket, 4 cores, 2 threads/core. If I start a guest with 32 > >>> VCPUs (i.e., heavily over-subscribing the host), then the issue hits > >>> immediately. Device assignment is not necessary. > >>> > >>> But, it also reproduces fairly regularly with just 8 VCPUs (even > >>> with VCPUs pinned to separate logical processors on the host). > >>> > >>> Judging purely from the KVM trace captured during such a 32-VCPU > >>> startup, I'm suspecting a general thread scheduling (or signaling) > >>> change sometime in the v4.1..v4.2 interval. It takes *extremely* > >>> long after the second SIPI (which should wake the AP threads) until > >>> the first "kvm_entry" message shows up for the last nonzero VCPU. > >>> > >>> Below is an excerpt. The first line shows the 2nd SIPI being > >>> delivered to the last VCPU. The lines after that show the first time > >>> that each nonzero VCPU reports in with "kvm_entry". > >>> > >>> qemu-system-x86-19114 [005] 6668.931183: kvm_apic_accept_irq: apicid > >>> 1f vec 159 (SIPI|edge) > >>> ... > >>> qemu-system-x86-19136 [006] 6668.931953: kvm_entry: vcpu 22 > >>> qemu-system-x86-19133 [003] 6668.931953: kvm_entry: vcpu 19 > >>> qemu-system-x86-19138 [007] 6668.931953: kvm_entry: vcpu 24 > >>> qemu-system-x86-19118 [004] 6668.934967: kvm_entry: vcpu 4 > >>> qemu-system-x86-19119 [006] 6668.941502: kvm_entry: vcpu 5 > >>> qemu-system-x86-19121 [004] 6668.944957: kvm_entry: vcpu 7 > >>> qemu-system-x86-19124 [001] 6668.944957: kvm_entry: vcpu 10 > >>> qemu-system-x86-19137 [000] 6668.948561: kvm_entry: vcpu 23 > >>> qemu-system-x86-19140 [002] 6668.948561: kvm_entry: vcpu 26 > >>> qemu-system-x86-19131 [003] 6668.948561: kvm_entry: vcpu 17 > >>> qemu-system-x86-19122 [007] 6668.948561: kvm_entry: vcpu 8 > >>> qemu-system-x86-19120 [001] 6668.964832: kvm_entry: vcpu 6 > >>> qemu-system-x86-19142 [002] 6668.964962: kvm_entry: vcpu 28 > >>> qemu-system-x86-19125 [000] 6668.964964: kvm_entry: vcpu 11 > >>> qemu-system-x86-19139 [007] 6668.964965: kvm_entry: vcpu 25 > >>> qemu-system-x86-19123 [002] 6668.968954: kvm_entry: vcpu 9 > >>> qemu-system-x86-19130 [000] 6668.968955: kvm_entry: vcpu 16 > >>> qemu-system-x86-19144 [000] 6668.974965: kvm_entry: vcpu 30 > >>> qemu-system-x86-19132 [003] 6668.974967: kvm_entry: vcpu 18 > >>> qemu-system-x86-19126 [002] 6668.981415: kvm_entry: vcpu 12 > >>> qemu-system-x86-19143 [003] 6668.981416: kvm_entry: vcpu 29 > >>> qemu-system-x86-19134 [000] 6668.981418: kvm_entry: vcpu 20 > >>> qemu-system-x86-19127 [003] 6669.002546: kvm_entry: vcpu 13 > >>> qemu-system-x86-19117 [000] 6669.004972: kvm_entry: vcpu 3 > >>> qemu-system-x86-19135 [003] 6669.004973: kvm_entry: vcpu 21 > >>> qemu-system-x86-19145 [002] 6669.004974: kvm_entry: vcpu 31 > >>> qemu-system-x86-19116 [005] 6669.005002: kvm_entry: vcpu 2 > >>> qemu-system-x86-19128 [000] 6669.014976: kvm_entry: vcpu 14 > >>> qemu-system-x86-19129 [002] 6669.015058: kvm_entry: vcpu 15 > >>> qemu-system-x86-19115 [002] 6669.029626: kvm_entry: vcpu 1 > >>> qemu-system-x86-19141 [005] 6669.034981: kvm_entry: vcpu 27 > >>> > >>> The last VCPU (#27) reports in 6669.034981 - 6668.931183 = 103.798 > >>> ms after the second SIPI's delivery is completed. > >>> > >>> That is about twice as long as the virtual BSP is willing to wait > >>> for the virtual AP detection to complete. (The default timeout is 50 > >>> ms.) > >> > >> This doesn't appear to be reproducing the same issue that Janusz is > >> seeing. > > > > You might be correct, but I can only work with what I can reproduce. > > > >> In that case even a *10 second* timeout was not enough for 8 > >> processors to start up. > > > > Note that I didn't try to await until the AP enumeration *completed* > > in the problematic cases. The only thing I "measured" in the above was > > the (nearly) *first* sign of life for each VCPU, after the > > INIT-SIPI-SIPI. Your timeout depends on the *last* relevant act of the > > "slowest" CPU. > > > > In any case, even if this issue is independent, the success vs. > > failure question is pretty clear on my workstation, wherever I am in > > the bisection. Assuming I end up with a relevant-looking culprit > > commit, we could do something about it, and ask Janusz to test it too. > > - I bisected v4.1..v4.3. > - I never assigned any device to the guest. In fact this was my > sophisticated test script: > > > OVMF_BIN=/home/vm-images/firmware/OVMF_CODE.fd > > VARSTORE=/root/vars.fd > > LOG=/root/ovmf.log > > > > /usr/bin/qemu-system-x86_64 \ > > -machine accel=kvm \ > > -smp cpus=10 \ > > -m 4096 \ > > -device qxl-vga \ > > -drive if=pflash,format=raw,unit=0,file="$OVMF_BIN",readonly=on \ > > -drive if=pflash,format=raw,unit=1,file="$VARSTORE" \ > > -debugcon file:"$LOG" \ > > -global isa-debugcon.iobase=0x402 \ > > -net none > > As I said, my workstation has 8 logical PCPUs (1*4*2). Using the above > minimal oversubscription (10 VCPUs), and requiring 10 successful OVMF > boots to qualify a kernel build "good" during bisection, I got *very > clear* decisions for each build, and the result is: > > > commit b18d5431acc7a2fd22767925f3a6f597aa4bd29e > > Author: Xiao Guangrong <guangrong.x...@linux.intel.com> > > Date: Mon Jun 15 16:55:21 2015 +0800 > > > > KVM: x86: fix CR0.CD virtualization > > > > Currently, CR0.CD is not checked when we virtualize memory cache type > > for > > noncoherent_dma guests, this patch fixes it by : > > > > - setting UC for all memory if CR0.CD = 1 > > - zapping all the last sptes in MMU if CR0.CD is changed > > > > Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> > > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > > > :040000 040000 a08dd8837198d53aab75569607c23e8dd05bb257 > > 25c21bcc6ca57f76477e7d0c540213909d9d000b M arch > > Here's the bisection log: > > > git bisect start > > # bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3 > > git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861 > > # good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1 > > git bisect good b953c0d234bc72e8489d3bf51a276c5c4ec85345 > > # bad: [31191a85fb875cf123cea56bbfd34f4b941f3c79] perf report: Add support > > for srcfile sort key > > git bisect bad 31191a85fb875cf123cea56bbfd34f4b941f3c79 > > # bad: [5b4ca4447757019f11a601b0009534ef84bed801] Merge tag 'media/v4.2-1' > > of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media > > git bisect bad 5b4ca4447757019f11a601b0009534ef84bed801 > > # bad: [8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2] mm: > > kmemleak_alloc_percpu() should follow the gfp from per_alloc() > > git bisect bad 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2 > > # good: [10b4b096d0c7e9f1b5f84c2a0658b2963e1e6ed0] Merge tag 'gpio-v4.2-1' > > of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio > > git bisect good 10b4b096d0c7e9f1b5f84c2a0658b2963e1e6ed0 > > # good: [36a1624d8844b6c165daf61649e6b68c02d0835f] Merge tag 'for-4.2' of > > git://git.infradead.org/battery-2.6 > > git bisect good 36a1624d8844b6c165daf61649e6b68c02d0835f > > # good: [4b1f2af6752a4cc9acc1c22ddf3842478965f113] Merge branch 'for-linus' > > of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux > > git bisect good 4b1f2af6752a4cc9acc1c22ddf3842478965f113 > > # good: [08d183e3c1f650b4db1d07d764502116861542fa] Merge tag > > 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux > > git bisect good 08d183e3c1f650b4db1d07d764502116861542fa > > # bad: [4e241557fc1cb560bd9e77ca1b4a9352732a5427] Merge tag 'for-linus' of > > git://git.kernel.org/pub/scm/virt/kvm/kvm > > git bisect bad 4e241557fc1cb560bd9e77ca1b4a9352732a5427 > > # good: [0ad83caa21bff9f383c10e73b452425709573042] x86: kvmclock: set > > scheduler clock stable > > git bisect good 0ad83caa21bff9f383c10e73b452425709573042 > > # good: [05fe125fa3237de2ec5bada80031e694de78909c] Merge tag > > 'kvm-arm-for-4.2' of > > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD > > git bisect good 05fe125fa3237de2ec5bada80031e694de78909c > > # bad: [19efffa244071ccd0385b240d03adb38feaab04e] KVM: MTRR: sort variable > > MTRRs > > git bisect bad 19efffa244071ccd0385b240d03adb38feaab04e > > # bad: [eb839917a75207b89799e3500211163cb6de0dea] KVM: MTRR: handle > > MSR_MTRRcap in kvm_mtrr_get_msr > > git bisect bad eb839917a75207b89799e3500211163cb6de0dea > > # good: [0b8ba4a2b65860ac1fcd5be6cfca5d3b27a5b170] KVM: fix checkpatch.pl > > errors in kvm/coalesced_mmio.h > > git bisect good 0b8ba4a2b65860ac1fcd5be6cfca5d3b27a5b170 > > # bad: [b18d5431acc7a2fd22767925f3a6f597aa4bd29e] KVM: x86: fix CR0.CD > > virtualization > > git bisect bad b18d5431acc7a2fd22767925f3a6f597aa4bd29e > > # good: [f104765b4f81fd74d69e0eb161e89096deade2db] KVM: nSVM: Check for > > NRIPS support before updating control field > > git bisect good f104765b4f81fd74d69e0eb161e89096deade2db > > # first bad commit: [b18d5431acc7a2fd22767925f3a6f597aa4bd29e] KVM: x86: > > fix CR0.CD virtualization > > (Side note: I had to fix up one of the early builds with "git > cherry-pick --no-commit 82819ffb42fb45197bacf3223191deca31d3eb91", > because otherwise the build failed, but that's not relevant here.) > > So, what does this mean? Because we've *always* known about this commit > (after Xiao told us about it)! Importantly, what does it mean that the > bug reproduces *without* an assigned device, and even in *v4.3*, where > Xiao's fixup > > > commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49 > > Author: Xiao Guangrong <guangrong.x...@intel.com> > > Date: Thu Jul 16 03:25:56 2015 +0800 > > > > KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED > > is in place (starting with v4.2-rc4)? > > It means that the original commit b18d5431acc7 modified *two* functions > (namely vmx_get_mt_mask() and kvm_set_cr0()), and that commit fb279950ba > fixed up *only one* of them (vmx_get_mt_mask()). > > This code remains unchanged in v4.3, from b18d5431acc7: > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 43f0df7..43fdb10 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -621,6 +621,10 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long > > cr0) > > > > if ((cr0 ^ old_cr0) & update_bits) > > kvm_mmu_reset_context(vcpu); > > + > > + if ((cr0 ^ old_cr0) & X86_CR0_CD) > > + kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL); > > + > > return 0; > > } > > EXPORT_SYMBOL_GPL(kvm_set_cr0); > > Xiao actually pointed this out (look at the first paragraph): > > > Kernel commit b18d5431acc causes the vCPUs need more time to startup > > as: > > - it zaps all the mappings for the guest memory in EPT or shadow page > > table, it requires VM-exits to rebuild the mappings for all memory > > access. > > > > - if there is device passthrough-ed in guest and IOMMU lacks snooping > > control feature, the memory will become UC after CR0.CD is set to > > 1. > > We've been focusing on the second paragraph (vmx_get_mt_mask()), but > apparently the kvm_set_cr0() change, which was not quirked, has a large > impact on some hosts. > > To me an interesting aspect of the above hunk is that *any one* VCPU > flipping its CR0.CD will re-set nested paging *globally*, which will > likely impact all other VCPUs as well, I guess. > > Plus, with kvm_zap_gfn_range() taking the guest-global kvm->mmu_lock > spin-lock, perhaps there is a bit of host-side lock contention too, when > the APs start up all at once. > > Anyway, with the following host kernel change, the AP startup problem > goes away (tested on top of v4.3): > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 9a9a198..4f978ad 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -622,7 +622,8 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long > > cr0) > > if ((cr0 ^ old_cr0) & update_bits) > > kvm_mmu_reset_context(vcpu); > > > > - if ((cr0 ^ old_cr0) & X86_CR0_CD) > > + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED) && > > + (cr0 ^ old_cr0) & X86_CR0_CD) > > kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL); > > > > return 0; > > (Honestly I just imitated fb279950ba here; I'm not making any better > argument for this diff. But, independently, I wonder why this hunk > didn't have the noncoherent DMA check either, originally.) > > Janusz, could you rebuild your host kernel with this patch, and share > the results? (I'm also attaching the same as a formatted patch, so you > can apply it with "git am" easily.)
One note related to decompression caching. I also created this patch: https://github.com/jljusten/edk2/commit/7a1d3799 In my https://github.com/jljusten/edk2/tree/decompress-mtrr branch. Infernix tried it and it didn't seem to totally fix the issue. (See comments on https://github.com/tianocore/edk2/issues/21) I don't know if Janusz got a chance to try it. I didn't clean it up and send it out, since I didn't have any evidence that it helped... -Jordan _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel