Re: cache write back barriers
On Wed, Jun 12, 2013 at 10:03:10AM +0200, folkert wrote: In virt-manager I saw that there's the option for cache writeback for storage devices. I'm wondering: does this also make kvm to ignore write barriers invoked by the virtual machine? No, that would be unsafe. When the guest issues a flush then QEMU will ensure that data reaches the disk with -drive cache=writeback. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cache write back barriers
Hi, In virt-manager I saw that there's the option for cache writeback for storage devices. I'm wondering: does this also make kvm to ignore write barriers invoked by the virtual machine? No, that would be unsafe. When the guest issues a flush then QEMU will ensure that data reaches the disk with -drive cache=writeback. Aha so the writeback behaves like the consume harddisks with write-cache on them. In that case maybe an extra note could be added to the virt-manager (excellent software by the way!) that if the client vm supports barriers, that write-back in that case then is safe. Agree? Folkert van Heusden -- Ever wonder what is out there? Any alien races? Then please support the seti@home project: setiathome.ssl.berkeley.edu -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Commit f9afbd45b0d0 broke mips r4k.
On Wed, Jun 12, 2013 at 09:35:16PM -0500, Rob Landley wrote: My aboriginal linux project builds tiny linux systems to run under qemu, producing as close to the same system as possible across a bunch of different architectures. The above change broke the mips r4k build I've been running under qemu. Here's a toolchain and reproduction sequence: wget http://landley.net/aboriginal/bin/cross-compiler-mips.tar.bz2 tar xvjf cross-compiler-mips.tar.bz2 export PATH=$PWD/cross-compiler-mips/bin:$PATH make ARCH=mips allnoconfig KCONFIG_ALLCONFIG=miniconfig.mips make CROSS_COMPILE=mips- ARCH=mips (The file miniconfig.mips is attached.) It ends: CC init/version.o LD init/built-in.o arch/mips/built-in.o: In function `local_r4k_flush_cache_page': c-r4k.c:(.text+0xe278): undefined reference to `kvm_local_flush_tlb_all' c-r4k.c:(.text+0xe278): relocation truncated to fit: R_MIPS_26 against `kvm_local_flush_tlb_all' arch/mips/built-in.o: In function `local_flush_tlb_range': (.text+0xe938): undefined reference to `kvm_local_flush_tlb_all' arch/mips/built-in.o: In function `local_flush_tlb_range': (.text+0xe938): relocation truncated to fit: R_MIPS_26 against `kvm_local_flush_tlb_all' arch/mips/built-in.o: In function `local_flush_tlb_mm': (.text+0xed38): undefined reference to `kvm_local_flush_tlb_all' arch/mips/built-in.o: In function `local_flush_tlb_mm': (.text+0xed38): relocation truncated to fit: R_MIPS_26 against `kvm_local_flush_tlb_all' kernel/built-in.o: In function `__schedule': core.c:(.sched.text+0x16a0): undefined reference to `kvm_local_flush_tlb_all' core.c:(.sched.text+0x16a0): relocation truncated to fit: R_MIPS_26 against `kvm_local_flush_tlb_all' mm/built-in.o: In function `use_mm': (.text+0x182c8): undefined reference to `kvm_local_flush_tlb_all' mm/built-in.o: In function `use_mm': (.text+0x182c8): relocation truncated to fit: R_MIPS_26 against `kvm_local_flush_tlb_all' fs/built-in.o:(.text+0x7b50): more undefined references to `kvm_local_flush_tlb_all' follow fs/built-in.o: In function `flush_old_exec': (.text+0x7b50): relocation truncated to fit: R_MIPS_26 against `kvm_local_flush_tlb_all' Revert the above commit and it builds to the end. Commit d414976d1ca721456f7b7c603a8699d117c2ec07 [MIPS: include: mmu_context.h: Replace VIRTUALIZATION with KVM] fixes the issue and was pulled by Linus only yesterday. I cannot reproduce the error following your receipe using the latest Linux/MIPS tree. Ralf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator
Hi Gleb, I'm trying to solve these problems in the past days and meet many difficulties. You want to save all the general registers in calling insn_page, so registers should be saved to (save) in insn_page. Because all the instructions should be generated outside and copy to insn_page, and the instructions generated outside is RIP-relative, so inside insn_page (save) will be wrong pointed with RIP-relative code. I have tried to move (save) into insn_page. But when calling insn_page, data in it can only be read and any instructions like xchg %%rax, 0+%[save] may cause error, because at this time read is from TLB but write will cause inconsistent. Another way is disabling RIP-relative code, but I failed when using -mcmodel-large -fno-pic, the binary is also using RIP-relative mode. Is there any way to totally disable RIP-relative code? Besides, using this feature may specified to some newer C compiler. This may not be a good solution. If we don't set %rsp and %rbp when executing emulator code, we can just use “push/pop to save other general registers. If you have any better solutions, please let me know. Thanks, Arthur On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 Arthur Chunqi Li yzt...@gmail.com wrote: On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto: Add a function trap_emulator to run an instruction in emulator. Set inregs first (%rax is invalid because it is used as return address), put instruction codec in alt_insn and call func with alt_insn_length. Get results in outregs. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- x86/emulator.c | 81 1 file changed, 81 insertions(+) diff --git a/x86/emulator.c b/x86/emulator.c index 96576e5..8ab9904 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -11,6 +11,14 @@ int fails, tests; static int exceptions; +struct regs { + u64 rax, rbx, rcx, rdx; + u64 rsi, rdi, rsp, rbp; + u64 rip, rflags; +}; + +static struct regs inregs, outregs; + void report(const char *name, int result) { ++tests; @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem) report(shrd (cl), *mem == ((0x12345678 3) | (5u 29))); } +static void trap_emulator(uint64_t *mem, uint8_t *insn_page, + uint8_t *alt_insn_page, void *insn_ram, + uint8_t *alt_insn, int alt_insn_length) +{ + ulong *cr3 = (ulong *)read_cr3(); + int i; + + // Pad with RET instructions + memset(insn_page, 0xc3, 4096); + memset(alt_insn_page, 0xc3, 4096); + + // Place a trapping instruction in the page to trigger a VMEXIT + insn_page[0] = 0x89; // mov %eax, (%rax) + insn_page[1] = 0x00; + insn_page[2] = 0x90; // nop + insn_page[3] = 0xc3; // ret + + // Place the instruction we want the hypervisor to see in the alternate page + for (i=0; ialt_insn_length; i++) + alt_insn_page[i] = alt_insn[i]; + + // Save general registers + asm volatile( + push %rax\n\r + push %rbx\n\r + push %rcx\n\r + push %rdx\n\r + push %rsi\n\r + push %rdi\n\r + ); This will not work if GCC is using rsp-relative addresses to access local variables. You need to use mov instructions to load from inregs, and put the push/pop sequences inside the main asm that does the call *%1. Is there any way to let gcc use absolute address to access variables? I move variant save to the global and use xchg %%rax, 0+%[save] and it seems that addressing for save is wrong. Arthur Paolo + // Load the code TLB with insn_page, but point the page tables at + // alt_insn_page (and keep the data TLB clear, for AMD decode assist). + // This will make the CPU trap on the insn_page instruction but the + // hypervisor will see alt_insn_page. + install_page(cr3, virt_to_phys(insn_page), insn_ram); + invlpg(insn_ram); + // Load code TLB + asm volatile(call *%0 : : r(insn_ram + 3)); + install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); + // Trap, let hypervisor emulate at alt_insn_page + asm volatile( + call *%1\n\r + + mov %%rax, 0+%[outregs] \n\t + mov %%rbx, 8+%[outregs] \n\t + mov %%rcx, 16+%[outregs] \n\t + mov %%rdx, 24+%[outregs] \n\t + mov %%rsi, 32+%[outregs] \n\t + mov %%rdi, 40+%[outregs] \n\t + mov %%rsp,48+ %[outregs] \n\t + mov %%rbp, 56+%[outregs] \n\t + + /* Save RFLAGS in outregs*/ + pushf \n\t + popq 72+%[outregs] \n\t + : [outregs]+m(outregs) + : r(insn_ram), + a(mem), b(inregs.rbx), + c(inregs.rcx), d(inregs.rdx), +
Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with entry failed, hardware error 0x80000021
On 09.06.2013 11:43, Gleb Natapov wrote: On Thu, Jun 06, 2013 at 02:10:39PM +0200, Stefan Pietsch wrote: On 06.06.2013 13:40, Gleb Natapov wrote: On Thu, Jun 06, 2013 at 01:35:13PM +0200, Stefan Pietsch wrote: I had no success with the Debian kernel 3.10~rc4-1~exp1 (3.10-rc4-686-pae). The machine hangs after Enabling APIC mode: Flat. Using 1 I/O APICs. OK, since it looks like it hangs during timer initialization can you try to disable kvmclock? Add -cpu qemu64,-kvmclock to your command line. Also can you provide the output of cat /proc/cpuinfo on your host? And complete serial output before hang. command line: qemu-system-i386 -machine accel=kvm -m 512 -cpu qemu64,-kvmclock -cdrom grml32-full_2013.02.iso -serial file:ttyS0.log ttyS0.log: ## Nothing out of ordinary here. Since you can reproduce the hang and I cannot, can you try and bisect it? Also can trace kvm during the hang http://www.linux-kvm.org/page/Tracing? Start the trace as close to hang as possible and stop it as quick after it as possible too to make trace file smaller. git bisect tells me: 79fd50c67f91136add9726fb7719b57a66c6f763 is the first bad commit This is my bisect log: git bisect start git bisect bad 9626357371b519f2b955fef399647181034a77fe git bisect good ef4e359d9b9e2dc022f79840fd207796b524a893 git bisect good b5c78e04dd061b776978dad61dd85357081147b0 git bisect good 9e2d59ad580d590134285f361a0e80f0e98c0207 git bisect bad 69086a78bdc973ec0b722be790b146e84ba8a8c4 git bisect good 9ecf9b085a0926e07c78c08a07296bbfd1c37d07 git bisect bad 21fbd5809ad126b949206d78e0a0e07ec872ea11 git bisect bad 79fd50c67f91136add9726fb7719b57a66c6f763 git bisect good 66cdd0ceaf65a18996f561b770eedde1d123b019 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator
Il 13/06/2013 05:30, 李春奇 Arthur Chunqi Li ha scritto: Hi Gleb, I'm trying to solve these problems in the past days and meet many difficulties. You want to save all the general registers in calling insn_page, so registers should be saved to (save) in insn_page. Because all the instructions should be generated outside and copy to insn_page, and the instructions generated outside is RIP-relative, so inside insn_page (save) will be wrong pointed with RIP-relative code. I have tried to move (save) into insn_page. But when calling insn_page, data in it can only be read and any instructions like xchg %%rax, 0+%[save] may cause error, because at this time read is from TLB but write will cause inconsistent. Another way is disabling RIP-relative code, but I failed when using -mcmodel-large -fno-pic, the binary is also using RIP-relative mode. Is there any way to totally disable RIP-relative code? Besides, using this feature may specified to some newer C compiler. This may not be a good solution. If we don't set %rsp and %rbp when executing emulator code, we can just use “push/pop to save other general registers. %rbp should not be a problem, on the other hand it's okay not to include %rsp in the registers struct (and assume insn_page/alt_insn_page do not touch it). Interestingly, both VMX and SVM put the guest RSP in the VM control information so that the switch occurs atomically with the start of the guest. Paolo If you have any better solutions, please let me know. Thanks, Arthur On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 Arthur Chunqi Li yzt...@gmail.com wrote: On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto: Add a function trap_emulator to run an instruction in emulator. Set inregs first (%rax is invalid because it is used as return address), put instruction codec in alt_insn and call func with alt_insn_length. Get results in outregs. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- x86/emulator.c | 81 1 file changed, 81 insertions(+) diff --git a/x86/emulator.c b/x86/emulator.c index 96576e5..8ab9904 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -11,6 +11,14 @@ int fails, tests; static int exceptions; +struct regs { + u64 rax, rbx, rcx, rdx; + u64 rsi, rdi, rsp, rbp; + u64 rip, rflags; +}; + +static struct regs inregs, outregs; + void report(const char *name, int result) { ++tests; @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem) report(shrd (cl), *mem == ((0x12345678 3) | (5u 29))); } +static void trap_emulator(uint64_t *mem, uint8_t *insn_page, + uint8_t *alt_insn_page, void *insn_ram, + uint8_t *alt_insn, int alt_insn_length) +{ + ulong *cr3 = (ulong *)read_cr3(); + int i; + + // Pad with RET instructions + memset(insn_page, 0xc3, 4096); + memset(alt_insn_page, 0xc3, 4096); + + // Place a trapping instruction in the page to trigger a VMEXIT + insn_page[0] = 0x89; // mov %eax, (%rax) + insn_page[1] = 0x00; + insn_page[2] = 0x90; // nop + insn_page[3] = 0xc3; // ret + + // Place the instruction we want the hypervisor to see in the alternate page + for (i=0; ialt_insn_length; i++) + alt_insn_page[i] = alt_insn[i]; + + // Save general registers + asm volatile( + push %rax\n\r + push %rbx\n\r + push %rcx\n\r + push %rdx\n\r + push %rsi\n\r + push %rdi\n\r + ); This will not work if GCC is using rsp-relative addresses to access local variables. You need to use mov instructions to load from inregs, and put the push/pop sequences inside the main asm that does the call *%1. Is there any way to let gcc use absolute address to access variables? I move variant save to the global and use xchg %%rax, 0+%[save] and it seems that addressing for save is wrong. Arthur Paolo + // Load the code TLB with insn_page, but point the page tables at + // alt_insn_page (and keep the data TLB clear, for AMD decode assist). + // This will make the CPU trap on the insn_page instruction but the + // hypervisor will see alt_insn_page. + install_page(cr3, virt_to_phys(insn_page), insn_ram); + invlpg(insn_ram); + // Load code TLB + asm volatile(call *%0 : : r(insn_ram + 3)); + install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); + // Trap, let hypervisor emulate at alt_insn_page + asm volatile( + call *%1\n\r + + mov %%rax, 0+%[outregs] \n\t + mov %%rbx, 8+%[outregs] \n\t + mov %%rcx, 16+%[outregs] \n\t + mov %%rdx, 24+%[outregs] \n\t + mov %%rsi, 32+%[outregs] \n\t + mov %%rdi, 40+%[outregs] \n\t +
Re: KVM MMU: why write-protect for the pages containing PML4/PDPT/PDT (page directory) of the guest?
Il 12/06/2013 23:28, yongcheng...@i-soft.com.cn ha scritto: I have a problem for shadow page table. why is write-protect for the pages containing PML4/PDPT/PDT (page directory) of the guest? In other words, need to synchronize the change of the page directory of the guest? Shadow page tables are the combination of both the host and guest page tables into a single translation. So they need to be updated every time the host or the guest change the page tables. Updates for the host page tables are tracked with MMU notifiers; updates for the guest page tables are tracked with write protection. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bottleneck in KVM
Hello All I am relatively new to kvm. I have installed a web-server in kvm machine and pushing different request rates on kvm through httperf. While on a bare host , i can go till 6000 request rates per second, the performance in kvm does not increase beyond 3500 request rates, i have checked CPU usage, for the different modules and no module is getting exhausted in CPU. Enough CPUin VM remains idle in this case. I doubt whether i am exhausting on some buffer. Please provide details what could be the problem and bottlleneck in this case. Thanks and Regards Ankit Anand -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with entry failed, hardware error 0x80000021
Il 13/06/2013 07:57, Stefan Pietsch ha scritto: git bisect tells me: 79fd50c67f91136add9726fb7719b57a66c6f763 is the first bad commit This is an s390 commit, so the bisect somehow went wrong. Can you confirm that 3.7 works and 3.8 doesn't? Please check these pairs: 9e2d59a and 89f883372fa60f604d136924baf3e89ff1870e9e 39ab967 and 875b7679abbb232b584f2eec59fa6e45690dd6c4 10b3866 and ea4a0ce11160200410abbabd44ec9e75e93a95be 4ffd4eb and ccae663cd4f62890d862c660e5ed762eb9821c14 896ea17 and 66cdd0ceaf65a18996f561b770eedde1d123b019 Please tell us which pair introduced the failure. Then: - if you get a bad and bad pair, tell us and we'll figure out what's next :) - if you get a good and bad pair, do a git bisect between the two commits in that pair. Thanks! Paolo This is my bisect log: git bisect start git bisect bad 9626357371b519f2b955fef399647181034a77fe git bisect good ef4e359d9b9e2dc022f79840fd207796b524a893 git bisect good b5c78e04dd061b776978dad61dd85357081147b0 git bisect good 9e2d59ad580d590134285f361a0e80f0e98c0207 git bisect bad 69086a78bdc973ec0b722be790b146e84ba8a8c4 git bisect good 9ecf9b085a0926e07c78c08a07296bbfd1c37d07 git bisect bad 21fbd5809ad126b949206d78e0a0e07ec872ea11 git bisect bad 79fd50c67f91136add9726fb7719b57a66c6f763 git bisect good 66cdd0ceaf65a18996f561b770eedde1d123b019 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with entry failed, hardware error 0x80000021
Il 13/06/2013 09:42, Paolo Bonzini ha scritto: Il 13/06/2013 07:57, Stefan Pietsch ha scritto: git bisect tells me: 79fd50c67f91136add9726fb7719b57a66c6f763 is the first bad commit This is an s390 commit, so the bisect somehow went wrong. Can you confirm that 3.7 works and 3.8 doesn't? Sorry, 3.8 works and 3.9 doesn't (66cdd0ceaf65a18996f561b770eedde1d123b019 was the 3.8 merge window update, and your bisect shows it as good). Can you double-check this with both normal modprobe kvm_intel and modprobe kvm_intel emulate_invalid_guest_state=0? Paolo Please check these pairs: 9e2d59a and 89f883372fa60f604d136924baf3e89ff1870e9e 39ab967 and 875b7679abbb232b584f2eec59fa6e45690dd6c4 10b3866 and ea4a0ce11160200410abbabd44ec9e75e93a95be 4ffd4eb and ccae663cd4f62890d862c660e5ed762eb9821c14 896ea17 and 66cdd0ceaf65a18996f561b770eedde1d123b019 Please tell us which pair introduced the failure. Then: - if you get a bad and bad pair, tell us and we'll figure out what's next :) - if you get a good and bad pair, do a git bisect between the two commits in that pair. Thanks! Paolo This is my bisect log: git bisect start git bisect bad 9626357371b519f2b955fef399647181034a77fe git bisect good ef4e359d9b9e2dc022f79840fd207796b524a893 git bisect good b5c78e04dd061b776978dad61dd85357081147b0 git bisect good 9e2d59ad580d590134285f361a0e80f0e98c0207 git bisect bad 69086a78bdc973ec0b722be790b146e84ba8a8c4 git bisect good 9ecf9b085a0926e07c78c08a07296bbfd1c37d07 git bisect bad 21fbd5809ad126b949206d78e0a0e07ec872ea11 git bisect bad 79fd50c67f91136add9726fb7719b57a66c6f763 git bisect good 66cdd0ceaf65a18996f561b770eedde1d123b019 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/7] target-arm: cpregs list for migration, kvm reset
On 3 June 2013 14:47, Peter Maydell peter.mayd...@linaro.org wrote: This patch series overhauls how we handle ARM coprocessor registers, so that we use a consistent approach for migration, reset and QEMU-KVM synchronisation, driven by the kernel's list of supported registers. Applied to target-arm.next. (If these were on somebody's to-review list, yell and I'll unapply them; but these plus v1 have been on list for a fair while without attracting any particular interest.) thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cache write back barriers
I'm wondering: does this also make kvm to ignore write barriers invoked by the virtual machine? no, cache=writeback is ok, write barriers are working correctly only with cache=unsafe,it doesn't care about write flush. - Mail original - De: folkert folk...@vanheusden.com À: kvm@vger.kernel.org Envoyé: Mercredi 12 Juin 2013 10:03:10 Objet: cache write back barriers Hi, In virt-manager I saw that there's the option for cache writeback for storage devices. I'm wondering: does this also make kvm to ignore write barriers invoked by the virtual machine? regards, Folkert van Heusden -- Always wondered what the latency of your webserver is? Or how much more latency you get when you go through a proxy server/tor? The numbers tell the tale and with HTTPing you know them! http://www.vanheusden.com/httping/ --- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with entry failed, hardware error 0x80000021
On 13.06.2013 15:42, Paolo Bonzini wrote: Il 13/06/2013 07:57, Stefan Pietsch ha scritto: git bisect tells me: 79fd50c67f91136add9726fb7719b57a66c6f763 is the first bad commit This is an s390 commit, so the bisect somehow went wrong. Can you confirm that 3.7 works and 3.8 doesn't? Confirmed. Something went wrong. I replayed the bisect log and now I have git bisect bad 9626357371b519f2b955fef399647181034a77fe git bisect good ef4e359d9b9e2dc022f79840fd207796b524a893 git bisect good b5c78e04dd061b776978dad61dd85357081147b0 git bisect good 9e2d59ad580d590134285f361a0e80f0e98c0207 git bisect bad 69086a78bdc973ec0b722be790b146e84ba8a8c4 git bisect good 9ecf9b085a0926e07c78c08a07296bbfd1c37d07 git bisect bad 21fbd5809ad126b949206d78e0a0e07ec872ea11 git bisect bad 79fd50c67f91136add9726fb7719b57a66c6f763 git bisect bad aa11e3a8a6d9f92c3fe4b91a9aca5d8c23d55d4d git bisect good 66cdd0ceaf65a18996f561b770eedde1d123b019 git bisect bad d99e415275dd3f757b75981adad8645cdc26da45 So please wait for my results. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator
Add a function trap_emulator to run an instruction in emulator. Set inregs first (%rax is invalid because it is used as return address), put instruction codec in alt_insn and call func with alt_insn_length. Get results in outregs. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- x86/emulator.c | 132 1 file changed, 132 insertions(+) diff --git a/x86/emulator.c b/x86/emulator.c index 96576e5..4981bfb 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -11,6 +11,16 @@ int fails, tests; static int exceptions; +struct regs { + u64 rax, rbx, rcx, rdx; + u64 rsi, rdi, rsp, rbp; + u64 r8, r9, r10, r11; + u64 r12, r13, r14, r15; + u64 rip, rflags; +}; +static struct regs inregs, outregs; +extern struct regs save; + void report(const char *name, int result) { ++tests; @@ -685,6 +695,128 @@ static void test_shld_shrd(u32 *mem) report(shrd (cl), *mem == ((0x12345678 3) | (5u 29))); } +extern u8 insn_start[], insn_end[]; +extern u8 insn_emulate_start[], insn_emulate_end[]; + +static void mk_insn_page(uint8_t *insn_page, uint8_t *alt_insn_page, + uint8_t *alt_insn, int alt_insn_length) +{ + int i, emul_offset; + for (i=1; iinsn_emulate_end - insn_emulate_start; i++) + insn_emulate_start[i] = 0x90; // nop + for (i=0; iinsn_end - insn_start; i++) + insn_page[i] = insn_start[i]; + emul_offset = insn_emulate_start - insn_start; + for (i=0; ialt_insn_length; i++) + alt_insn_page[i+emul_offset] = alt_insn[i]; + + asm volatile( + .pushsection .text.insn, \ax\ \n\t + insn_start:\n\t + ret\n\t + + push %%rax; push %%rbx\n\t + push %%rcx; push %%rdx\n\t + push %%rsi; push %%rdi\n\t + push %%rbp\n\t + push %%r8; push %%r9\n\t + push %%r10; push %%r11\n\t + push %%r12; push %%r13\n\t + push %%r14; push %%r15\n\t + pushf\n\t + + push 136+%[save] \n\t + popf \n\t + mov 0+%[save], %%rax \n\t + mov 8+%[save], %%rbx \n\t + mov 16+%[save], %%rcx \n\t + mov 24+%[save], %%rdx \n\t + mov 32+%[save], %%rsi \n\t + mov 40+%[save], %%rdi \n\t + mov 56+%[save], %%rbp \n\t + mov 64+%[save], %%r8 \n\t + mov 72+%[save], %%r9 \n\t + mov 80+%[save], %%r10 \n\t + mov 88+%[save], %%r11 \n\t + mov 96+%[save], %%r12 \n\t + mov 104+%[save], %%r13 \n\t + mov 112+%[save], %%r14 \n\t + mov 120+%[save], %%r15 \n\t + + insn_emulate_start:\n\t + in (%%dx),%%al\n\t + . = . + 31\n\t + insn_emulate_end:\n\t + + pushf \n\t + pop 136+%[save] \n\t + mov %%rax, 0+%[save] \n\t + mov %%rbx, 8+%[save] \n\t + mov %%rcx, 16+%[save] \n\t + mov %%rdx, 24+%[save] \n\t + mov %%rsi, 32+%[save] \n\t + mov %%rdi, 40+%[save] \n\t + mov %%rbp, 56+%[save] \n\t + mov %%r8, 64+%[save]\n\t + mov %%r9, 72+%[save]\n\t + mov %%r10, 80+%[save]\n\t + mov %%r11, 88+%[save]\n\t + mov %%r12, 96+%[save]\n\t + mov %%r13, 104+%[save]\n\t + mov %%r14, 112+%[save]\n\t + mov %%r15, 120+%[save]\n\t + + popf\n\t + pop %%r15; pop %%r14 \n\t + pop %%r13; pop %%r12 \n\t + pop %%r11; pop %%r10 \n\t + pop %%r9; pop %%r8 \n\t + pop %%rbp \n\t + pop %%rdi; pop %%rsi \n\t + pop %%rdx; pop %%rcx \n\t + pop %%rbx; pop %%rax \n\t + + ret\n\t + + save:\n\t + . = . + 256\n\t + insn_end:\n\t + .popsection\n\t + : [save]=m(save) + : : memory, cc + ); +} + +static void trap_emulator(uint64_t *mem, uint8_t *insn_page, +uint8_t *alt_insn_page, void *insn_ram, +uint8_t* alt_insn, int alt_insn_length, int reserve_stack) +{ + ulong *cr3 = (ulong *)read_cr3(); + extern u8 insn_start[]; + int save_offset = (u8 *)(save) - insn_start; + + memset(insn_page, 0x90, 4096); + memset(alt_insn_page, 0x90, 4096); + + save = inregs; + mk_insn_page(insn_page, alt_insn_page, + alt_insn, alt_insn_length); + + // Load the code TLB with insn_page, but point the page tables at + // alt_insn_page (and keep the data TLB clear, for AMD
[PATCH 2/2] kvm-unit-tests: Change two cases to use trap_emulator
Change two functions (test_mmx_movq_mf and test_movabs) using unified trap_emulator. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- x86/emulator.c | 85 +++- 1 file changed, 23 insertions(+), 62 deletions(-) diff --git a/x86/emulator.c b/x86/emulator.c index 4981bfb..7698f56 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -826,73 +826,34 @@ static void advance_rip_by_3_and_note_exception(struct ex_regs *regs) static void test_mmx_movq_mf(uint64_t *mem, uint8_t *insn_page, uint8_t *alt_insn_page, void *insn_ram) { -uint16_t fcw = 0; // all exceptions unmasked -ulong *cr3 = (ulong *)read_cr3(); - -write_cr0(read_cr0() ~6); // TS, EM -// Place a trapping instruction in the page to trigger a VMEXIT -insn_page[0] = 0x89; // mov %eax, (%rax) -insn_page[1] = 0x00; -insn_page[2] = 0x90; // nop -insn_page[3] = 0xc3; // ret -// Place the instruction we want the hypervisor to see in the alternate page -alt_insn_page[0] = 0x0f; // movq %mm0, (%rax) -alt_insn_page[1] = 0x7f; -alt_insn_page[2] = 0x00; -alt_insn_page[3] = 0xc3; // ret - -exceptions = 0; -handle_exception(MF_VECTOR, advance_rip_by_3_and_note_exception); - -// Load the code TLB with insn_page, but point the page tables at -// alt_insn_page (and keep the data TLB clear, for AMD decode assist). -// This will make the CPU trap on the insn_page instruction but the -// hypervisor will see alt_insn_page. -install_page(cr3, virt_to_phys(insn_page), insn_ram); -asm volatile(fninit; fldcw %0 : : m(fcw)); -asm volatile(fldz; fldz; fdivp); // generate exception -invlpg(insn_ram); -// Load code TLB -asm volatile(call *%0 : : r(insn_ram + 3)); -install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); -// Trap, let hypervisor emulate at alt_insn_page -asm volatile(call *%0 : : r(insn_ram), a(mem)); -// exit MMX mode -asm volatile(fnclex; emms); -report(movq mmx generates #MF, exceptions == 1); -handle_exception(MF_VECTOR, 0); + uint16_t fcw = 0; // all exceptions unmasked + uint8_t alt_insn[] = {0x0f, 0x7f, 0x00}; // movq %mm0, (%rax) + + write_cr0(read_cr0() ~6); // TS, EM + exceptions = 0; + handle_exception(MF_VECTOR, advance_rip_by_3_and_note_exception); + asm volatile(fninit; fldcw %0 : : m(fcw)); + asm volatile(fldz; fldz; fdivp); // generate exception + + inregs = (struct regs){ 0 }; + trap_emulator(mem, insn_page, alt_insn_page, insn_ram, + alt_insn, 3, 1); + // exit MMX mode + asm volatile(fnclex; emms); + report(movq mmx generates #MF, exceptions == 1); + handle_exception(MF_VECTOR, 0); } static void test_movabs(uint64_t *mem, uint8_t *insn_page, uint8_t *alt_insn_page, void *insn_ram) { -uint64_t val = 0; -ulong *cr3 = (ulong *)read_cr3(); - -// Pad with RET instructions -memset(insn_page, 0xc3, 4096); -memset(alt_insn_page, 0xc3, 4096); -// Place a trapping instruction in the page to trigger a VMEXIT -insn_page[0] = 0x89; // mov %eax, (%rax) -insn_page[1] = 0x00; -// Place the instruction we want the hypervisor to see in the alternate -// page. A buggy hypervisor will fetch a 32-bit immediate and return -// 0xc3c3c3c3. -alt_insn_page[0] = 0x48; // mov $0xc3c3c3c3c3c3c3c3, %rcx -alt_insn_page[1] = 0xb9; - -// Load the code TLB with insn_page, but point the page tables at -// alt_insn_page (and keep the data TLB clear, for AMD decode assist). -// This will make the CPU trap on the insn_page instruction but the -// hypervisor will see alt_insn_page. -install_page(cr3, virt_to_phys(insn_page), insn_ram); -// Load code TLB -invlpg(insn_ram); -asm volatile(call *%0 : : r(insn_ram + 3)); -// Trap, let hypervisor emulate at alt_insn_page -install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); -asm volatile(call *%1 : =c(val) : r(insn_ram), a(mem), c(0)); -report(64-bit mov imm, val == 0xc3c3c3c3c3c3c3c3); + // mov $0xc3c3c3c3c3c3c3c3, %rcx + uint8_t alt_insn[] = {0x48, 0xb9, 0xc3, 0xc3, 0xc3, + 0xc3, 0xc3, 0xc3, 0xc3, 0xc3}; + inregs = (struct regs){ .rbx = 0x5678, .rcx = 0x1234 }; + trap_emulator(mem, insn_page, alt_insn_page, insn_ram, + alt_insn, 10, 1); + report(64-bit mov imm2, outregs.rcx == 0xc3c3c3c3c3c3c3c3); } static void test_crosspage_mmio(volatile uint8_t *mem) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] armv7 initial device passthrough support
Updated Device Passthrough Patch. - optimized IRQ-CPU-vCPU binding, irq is installed once - added dynamic IRQ affinity on schedule in - added documentation and few other coding recommendations. Per earlier discussion VFIO is our target but we like something earlier to work with to tackle performance latency issue (some ARM related) for device passthrough while we migrate towards VFIO. - Mario Signed-off-by: Mario Smarduch mario.smard...@huawei.com --- arch/arm/include/asm/kvm_host.h | 31 + arch/arm/include/asm/kvm_vgic.h | 10 ++ arch/arm/kvm/Makefile |1 + arch/arm/kvm/arm.c | 80 + arch/arm/kvm/assign-dev.c | 248 +++ arch/arm/kvm/vgic.c | 134 + include/linux/irqchip/arm-gic.h |1 + include/uapi/linux/kvm.h| 33 ++ 8 files changed, 538 insertions(+) create mode 100644 arch/arm/kvm/assign-dev.c diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 57cb786..c85c3a0 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -67,6 +67,10 @@ struct kvm_arch { /* Interrupt controller */ struct vgic_distvgic; + + /* Device Passthrough Fields */ + struct list_headassigned_dev_head; + struct mutexdev_passthrough_lock; }; #define KVM_NR_MEM_OBJS 40 @@ -146,6 +150,13 @@ struct kvm_vcpu_stat { u32 halt_wakeup; }; +struct kvm_arm_assigned_dev_kernel { + struct list_head list; + struct kvm_arm_assigned_device dev; + irqreturn_t (*irq_handler)(int, void *); + unsigned long vcpuid_irq_arg; +}; + struct kvm_vcpu_init; int kvm_vcpu_set_target(struct kvm_vcpu *vcpu, const struct kvm_vcpu_init *init); @@ -157,6 +168,26 @@ int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); u64 kvm_call_hyp(void *hypfn, ...); void force_vm_exit(const cpumask_t *mask); +#ifdef CONFIG_KVM_ARM_INT_PRIO_DROP +int kvm_arm_get_device_resources(struct kvm *, + struct kvm_arm_get_device_resources *); +int kvm_arm_assign_device(struct kvm *, struct kvm_arm_assigned_device *); +void kvm_arm_setdev_irq_affinity(struct kvm_vcpu *vcpu, int cpu); +#else +static inline int kvm_arm_get_device_resources(struct kvm *k, struct kvm_arm_get_device_resources *r) +{ + return -1; +} +static inline int kvm_arm_assign_device(struct kvm *k, struct kvm_arm_assigned_device *d) +{ + return -1; +} + +static inline void kvm_arm_setdev_irq_affinity(struct kvm_vcpu *vcpu, int cpu) +{ +} +#endif + #define KVM_ARCH_WANT_MMU_NOTIFIER struct kvm; int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index 343744e..fb6afd2 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -107,6 +107,16 @@ struct vgic_dist { /* Bitmap indicating which CPU has something pending */ unsigned long irq_pending_on_cpu; + + /* Device passthrough fields */ + /* Host irq to guest irq mapping */ + u8 guest_irq[VGIC_NR_SHARED_IRQS]; + + /* Pending passthruogh irq */ + struct vgic_bitmap passthrough_spi_pending; + + /* At least one passthrough IRQ pending for some vCPU */ + u32 passthrough_pending; #endif }; diff --git a/arch/arm/kvm/Makefile b/arch/arm/kvm/Makefile index 53c5ed8..823fc38 100644 --- a/arch/arm/kvm/Makefile +++ b/arch/arm/kvm/Makefile @@ -21,3 +21,4 @@ obj-y += arm.o handle_exit.o guest.o mmu.o emulate.o reset.o obj-y += coproc.o coproc_a15.o mmio.o psci.o perf.o obj-$(CONFIG_KVM_ARM_VGIC) += vgic.o obj-$(CONFIG_KVM_ARM_TIMER) += arch_timer.o +obj-$(CONFIG_KVM_ARM_INT_PRIO_DROP) += assign-dev.o diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 37d216d..ba54c64 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -26,6 +26,8 @@ #include linux/mman.h #include linux/sched.h #include linux/kvm.h +#include linux/interrupt.h +#include linux/ioport.h #include trace/events/kvm.h #define CREATE_TRACE_POINTS @@ -43,6 +45,7 @@ #include asm/kvm_emulate.h #include asm/kvm_coproc.h #include asm/kvm_psci.h +#include asm/kvm_host.h #ifdef REQUIRES_VIRT __asm__(.arch_extension virt); @@ -139,6 +142,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) /* Mark the initial VMID generation invalid */ kvm-arch.vmid_gen = 0; + /* +* Initialize Dev Passthrough Fields +*/ + INIT_LIST_HEAD(kvm-arch.assigned_dev_head); + mutex_init(kvm-arch.dev_passthrough_lock); return ret; out_free_stage2_pgd: @@ -169,6 +177,40 @@ int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages) void kvm_arch_destroy_vm(struct kvm *kvm) { int i; + struct list_head
RE: Bottleneck in KVM
-Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of an...@ssl.serc.iisc.in Sent: Thursday, June 13, 2013 9:18 PM To: kvm@vger.kernel.org Subject: Bottleneck in KVM Hello All I am relatively new to kvm. I have installed a web-server in kvm machine and pushing different request rates on kvm through httperf. While on a bare host , i can go till 6000 request rates per second, the performance in kvm does not increase beyond 3500 request rates, i have checked CPU usage, for the different modules and no module is getting exhausted in CPU. Enough CPUin VM remains idle in this case. I doubt whether i am exhausting on some buffer. Please provide details what could be the problem and bottlleneck in this case. If your application is CPU intensive, I think it's easy to achieve 90% perf in a KVM guest. 1. what's your qemu command line to start the guest ? 2. how about your I/O ? Is your service I/O intensive? 3. your hardware ? Thanks and Regards Ankit Anand -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] KVM: MMU: fast invalidate all mmio sptes
On Fri, Jun 07, 2013 at 04:51:22PM +0800, Xiao Guangrong wrote: Changelog: V3: All of these changes are from Gleb's review: 1) rename RET_MMIO_PF_EMU to RET_MMIO_PF_EMULATE. 2) smartly adjust kvm generation number in kvm_current_mmio_generatio() to avoid kvm_memslots-generation overflow. V2: - rename kvm_mmu_invalid_mmio_spte to kvm_mmu_invalid_mmio_sptes - use kvm-memslots-generation as kvm global generation-number - fix comment and codestyle - init kvm generation close to mmio wrap-around value - keep kvm_mmu_zap_mmio_sptes The current way is holding hot mmu-lock and walking all shadow pages, this is not scale. This patchset tries to introduce a very simple and scale way to fast invalidate all mmio sptes - it need not walk any shadow pages and hold any locks. Hi Xiao, - Where is the generation number increased? - Should use spinlock breakable code in kvm_mmu_zap_mmio_sptes() (picture guest with 512GB of RAM, even walking all those pages is expensive) (ah, patch to remove kvm_mmu_zap_mmio_sptes does that). - Is -13 enough to test wraparound? Its highly likely the guest has not began executing by the time 13 kvm_set_memory_calls are made (so no sptes around). Perhaps -2000 is more sensible (should confirm though). - Why remove if (change == KVM_MR_CREATE) || (change == KVM_MR_MOVE) from kvm_arch_commit_memory_region? Its instructive. Otherwise looks good. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: KVM MMU: why write-protect for the pages containing PML4/PDPT/PDT (page directory) of the guest?
I got it , thank you for your help. 2013/6/13 Paolo Bonzini pbonz...@redhat.com Il 12/06/2013 23:28, yongcheng...@i-soft.com.cn ha scritto: I have a problem for shadow page table. why is write-protect for the pages containing PML4/PDPT/PDT (page directory) of the guest? In other words, need to synchronize the change of the page directory of the guest? Shadow page tables are the combination of both the host and guest page tables into a single translation. So they need to be updated every time the host or the guest change the page tables. Updates for the host page tables are tracked with MMU notifiers; updates for the guest page tables are tracked with write protection. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Book3s_hv KVM HTAB bug
Hi Paul, We've just seen another KVM bug with 3.8 on p7. It looks as if for some reason a bolted HTAB entry for the kernel got evicted. [ 16s] booting kvm ... [ 16s] /usr/bin/qemu-system-ppc64 -no-reboot -nographic -vga none -net none -enable-kvm -M pseries -cpu host -kernel /boot/vmlinux -initrd /boot/initrd -append root=/dev/vda panic=1 quiet no-kvmclock nmi_watchdog=0 rw elevator=noop console=hvc0 init=/.build/build -m 3072 -drive file=/obs/worker/root_1/root,if=virtio,cache=unsafe -drive file=/obs/worker/root_1/root,if=ide,index=0,cache=unsafe -drive file=/obs/worker/root_1.swap,if=virtio,cache=unsafe -smp 1 [ 16s] [ 16s] [ 16s] SLOF ** [ 16s] QEMU Starting [ 16s] Build Date = Jun 10 2013 17:00:23 [ 16s] FW Version = git-f564e52f4418d308 [ 16s] Press s to enter Open Firmware. [ 16s] [ 16s] C [ 16s] C0100 [ 17s] C0120 [ 17s] C0140 [ 17s] C0200 [ 17s] C0201 [ 17s] C0220 [ 17s] C0240 [ 17s] C0260 [ 17s] C0270 [ 17s] C02E0 [ 17s] C0300 [ 17s] C0320 [ 17s] C0360 [ 17s] C0370 [ 17s] C0371 [ 17s] C0372 [ 17s] C0373 [ 17s] C0374 [ 17s] C0390 [ 17s] C03F0 [ 17s] C0400 [ 17s] C0480 [ 17s] C04C0 [ 17s] C04D0 [ 17s] C0500 [ 17s] Populating /vdevice methods [ 17s] Populating /vdevice/vty@7100 [ 18s] Populating /vdevice/nvram@7101 [ 18s] [ 18s] NVRAM: size=65536, fetch=200E, store=200F [ 18s] Populating /vdevice/v-scsi@7102 [ 18s] VSCSI: Initializing [ 18s] VSCSI: Looking for devices [ 18s] 8200 CD-ROM : QEMU QEMU CD-ROM 1.5. [ 18s] C0580 [ 18s] C05A0 [ 18s] Populating /pci@8002000 [ 18s] Adapters on 08002000 [ 18s] 00 (D) : 1af4 1001virtio [ block ] [ 18s] 00 0800 (D) : 1af4 1001virtio [ block ] [ 18s] C0600 [ 18s] C0640 [ 18s] C0690 [ 18s] C06A0 [ 18s] C06A8 [ 18s] C06B0 [ 18s] C06B8 [ 18s] C06C0 [ 18s] C06E0 [ 18s] C0700 [ 18s] C0800 [ 18s] C0880 [ 18s] No NVRAM common partition, re-initializing... [ 18s] C0890 [ 18s] C08A0 [ 19s] C08A8 [ 19s] C08B0 [ 19s] C08C0 [ 19s] C08D0 [ 19s] Using default console: /vdevice/vty@7100 [ 19s] C08E0 [ 19s] C08E8 [ 19s] Detected RAM kernel at 40 (16185f0 bytes) C08FF [ 19s] [ 19s] Welcome to Open Firmware [ 19s] [ 19s] Copyright (c) 2004, 2011 IBM Corporation All rights reserved. [ 19s] This program and the accompanying materials are made available [ 19s] under the terms of the BSD License available at [ 19s] http://www.opensource.org/licenses/bsd-license.php [ 19s] [ 19s] Booting from memory... [ 19s] OF stdout device is: /vdevice/vty@7100 [ 19s] Preparing to boot Linux version 3.8.0-2-default (geeko@buildhost) (gcc version 4.5.0 20100414 [gcc-4_5-branch revision 158342] (SUSE Linux) ) #1 SMP Wed Feb 20 02:54:06 UTC 2013 (e252f7f) [ 19s] Detected machine type: 0101 [ 19s] Max number of cores passed to firmware: 1024 (NR_CPUS = 1024) [ 19s] Calling ibm,client-architecture-support... not implemented [ 19s] couldn't open /packages/elf-loader [ 19s] command line: root=/dev/vda panic=1 quiet no-kvmclock nmi_watchdog=0 rw elevator=noop console=hvc0 init=/.build/build [ 19s] memory layout at init: [ 19s] memory_limit : (16 MB aligned) [ 19s] alloc_bottom : 01a3 [ 19s] alloc_top: 1000 [ 19s] alloc_top_hi : c000 [ 19s] rmo_top : 1000 [ 19s] ram_top : c000 [ 19s] instantiating rtas at 0x0dbf... done [ 19s] Querying for OPAL presence... not there. [ 19s] boot cpu hw idx 0 [ 19s] copying OF device tree... [ 19s] Building dt strings... [ 19s] Building dt structure... [ 19s] Device tree strings 0x01d4 - 0x01d40774 [ 19s] Device tree struct 0x01d5 - 0x01d6 [ 19s] Calling quiesce... [ 19s] returning from prom_init [20500s] QEMU 1.5.0 monitor - type 'help' for more information [20500s] (qemu) [20505s] (qemu) info registers [20505s] NIP 0410 LR 00b31240 CTR XER [20505s] MSR 80001000 HID0 HF idx 1 [20505s] TB DECR [20505s] GPR00 00b31240 c128bde0 01288c40 [20505s] GPR04 c128bcc0 4001438795007015 70001194 [20505s] GPR08 2288 b0001032 c0005d00 [20505s] GPR12 800040001032 cff2 [20505s] GPR16 [20505s] GPR20 [20505s] GPR24 4000 c000
Re: Book3s_hv KVM HTAB bug
On Thu, Jun 13, 2013 at 02:34:56PM +0200, Alexander Graf wrote: Hi Paul, We've just seen another KVM bug with 3.8 on p7. It looks as if for some reason a bolted HTAB entry for the kernel got evicted. ... (gdb) x /i 0xc0005d00 0xc0005d00 instruction_access_common:andi. r10,r12,16384 (qemu) xp /i 0x5d00 0x5d00: andi. r10,r12,16384 (qemu) info tlb SLBESIDVSID 3 0xc800 0xc00838795000 So for some reason QEMU can still resolve the virtual address using the guest HTAB, but the the CPU can not. Otherwise the guest wouldn't get a 0x400 when accessing that page. When I've seen this sort of thing it has usually been that we failed to insert a HPTE in htab_bolt_mapping(), called from htab_initialize(). When that happens we BUG_ON(), which is stupid because it causes a program interrupt, and the first thing we do is turn the MMU on, but we don't have a linear mapping set up, so we start taking continual instruction storage interrupts (because the ISI handler also wants to turn on interrupts). Ben has an idea to fix that, which is to have IR and DR off in paca-kernel_msr until we're ready to turn the MMU on. That might help debuggability in the case you're hitting, whether or not it's htab_bolt_mapping failing. Are you *absolutely* sure that QEMU is using the guest HTAB to translate the 0xc... addresses? If it is actually doing so it would need to be using the relatively new KVM_PPC_GET_HTAB_FD ioctl, and I thought the only place that was used was in the migration code. To debug this sort of thing, what I usually do is patch the guest kernel to put a branch to self at 0x400. Then when it hangs you have some chance of sorting out what happened using info registers etc. I would be very interested to know how big a HPT the host kernel allocated for the guest and what was in it. The host kernel prints a message telling you the size and location of the HPT, and in this sort of situation I find it helpful to take a copy of it with dd and dump it with hexdump. Also, what page size are you using in the host kernel? If it's 4k, then the guest kernel is limited to using 4k pages for the linear mapping, which can mean it runs out of space in the HPT for the linear mapping more easily. Since you don't have my patch to add a flexible allocator for the HPT and RMA areas (you rejected it, if you recall), you'll be limited to what you can allocate from the page allocator, which is usually 16MB, but may be less if free memory is low and/or fragmented. 16MB should be enough for a 3GB guest, particularly if you're using 64k pages in the host, but if the host was only able to allocate a much smaller HPT, that might explain the problem. Let me know if you discover anything further... Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Book3s_hv KVM HTAB bug
On 14.06.2013, at 01:20, Paul Mackerras wrote: On Thu, Jun 13, 2013 at 02:34:56PM +0200, Alexander Graf wrote: Hi Paul, We've just seen another KVM bug with 3.8 on p7. It looks as if for some reason a bolted HTAB entry for the kernel got evicted. ... (gdb) x /i 0xc0005d00 0xc0005d00 instruction_access_common:andi. r10,r12,16384 (qemu) xp /i 0x5d00 0x5d00: andi. r10,r12,16384 (qemu) info tlb SLBESIDVSID 3 0xc800 0xc00838795000 So for some reason QEMU can still resolve the virtual address using the guest HTAB, but the the CPU can not. Otherwise the guest wouldn't get a 0x400 when accessing that page. When I've seen this sort of thing it has usually been that we failed to insert a HPTE in htab_bolt_mapping(), called from htab_initialize(). When that happens we BUG_ON(), which is stupid because it causes a program interrupt, and the first thing we do is turn the MMU on, but we don't have a linear mapping set up, so we start taking continual instruction storage interrupts (because the ISI handler also wants to turn on interrupts). Ben has an idea to fix Ok, that makes sense and sounds like a reasonable possible failure scenario. Unfortunately the guest already got killed and right now everything's running again without any guest hanging. However, I did forget to also paste the dump of log_buf on my last email. Does that log coincide with what you would expect at this point? 00 00 00 00 00 00 00 00 00 4c 00 39 00 00 00 37 |.L.9...7| 0010 41 6c 6c 6f 63 61 74 65 64 20 39 31 37 35 30 34 |Allocated 917504| 0020 20 62 79 74 65 73 20 66 6f 72 20 31 30 32 34 20 | bytes for 1024 | 0030 70 61 63 61 73 20 61 74 20 63 30 30 30 30 30 30 |pacas at c00| 0040 30 30 66 66 32 30 30 30 30 00 00 00 00 00 00 00 |00ff2...| 0050 00 00 00 00 00 34 00 21 00 00 00 36 55 73 69 6e |.4.!...6Usin| 0060 67 20 70 53 65 72 69 65 73 20 6d 61 63 68 69 6e |g pSeries machin| 0070 65 20 64 65 73 63 72 69 70 74 69 6f 6e 00 00 00 |e description...| 0080 00 00 00 00 00 00 00 00 00 48 00 37 00 00 00 37 |.H.7...7| 0090 50 61 67 65 20 6f 72 64 65 72 73 3a 20 6c 69 6e |Page orders: lin| 00a0 65 61 72 20 6d 61 70 70 69 6e 67 20 3d 20 31 36 |ear mapping = 16| 00b0 2c 20 76 69 72 74 75 61 6c 20 3d 20 31 36 2c 20 |, virtual = 16, | 00c0 69 6f 20 3d 20 31 32 00 00 00 00 00 00 00 00 00 |io = 12.| 00d0 00 24 00 12 00 00 00 36 55 73 69 6e 67 20 31 54 |.$.6Using 1T| 00e0 42 20 73 65 67 6d 65 6e 74 73 00 00 00 00 00 00 |B segments..| 00f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 000186a0 that, which is to have IR and DR off in paca-kernel_msr until we're ready to turn the MMU on. That might help debuggability in the case you're hitting, whether or not it's htab_bolt_mapping failing. Are you *absolutely* sure that QEMU is using the guest HTAB to translate the 0xc... addresses? If it is actually doing so it would No, you're right. I got confused with PR KVM. I'm surprised QEMU is able to resolve anything at all really, without access to the HTAB. But it probably just saw that MSR.DR=0, so it simply used the real mode algorithm to read the data which happened to work correctly, as the virtual address is a valid real mode address as well. Sorry for the incorrect assumption. need to be using the relatively new KVM_PPC_GET_HTAB_FD ioctl, and I thought the only place that was used was in the migration code. To debug this sort of thing, what I usually do is patch the guest kernel to put a branch to self at 0x400. Then when it hangs you have some chance of sorting out what happened using info registers etc. Now if only it would happen a bit more often ;). I would be very interested to know how big a HPT the host kernel allocated for the guest and what was in it. The host kernel prints a message telling you the size and location of the HPT, and in this sort Yes. Unfortunately it doesn't tell me the PID though, so I have a hard time correlating the dmesg output with the VM. However, I'm pretty sure it's this one: Jun 13 06:31:16 build65 kernel: KVM guest htab at c0012ae0 (order 19), LPID 4 That's a 512kb map, right? Sounds too small to me :). of situation I find it helpful to take a copy of it with dd and dump it with hexdump. Too late this time around. I'll try to do it next time I see this happening :). Also, what page size are you using in the host kernel? If it's 4k, then the guest kernel is limited to using 4k pages for the linear mapping, which can mean it runs out of space in the HPT for the linear mapping more easily. In this case the host is running on 64k pages. Since you don't have my patch to add a flexible allocator for the HPT and RMA areas (you rejected it, if you recall),
Re: Book3s_hv KVM HTAB bug
On Fri, 2013-06-14 at 01:58 +0200, Alexander Graf wrote: I don't think that preallocating yet another potentially fragmented pool of bigger memory chunks - which your patch did - is the answer to this problem. We just need to defragment normal system memory and delay HPT creation until it's ready. It can't be that hard. ^ Rght Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html