Re: Shouldn't cache=none be the default for drives?
08.04.2010 09:07, Thomas Mueller wrote: [] This helped alot: I enabled deadline block scheduler instead of the default cfq on the host system. tested with: Host Debian with scheduler deadline, Guest Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost measured) Maybe this is also true for Linux/Linux. I expect that scheduler noop for linux guests would be good. Hmm. I wonder why it helped. In theory, host scheduler should not change anything for cache=none case, at least for raw partitions of LVM volumes. This is because with cache=none, the virtual disk image is opened with O_DIRECT flag, which means all I/O bypasses host scheduler and buffer cache. I tried a few quick tests here, -- with LVM volumes it makes no measurable difference. But if the guest disk images are on plain files (also raw), scheduler makes some difference, and indeed deadline works better. Maybe you were testing with plain files instead of block devices? Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
Am Thu, 08 Apr 2010 10:05:09 +0400 schrieb Michael Tokarev: 08.04.2010 09:07, Thomas Mueller wrote: [] This helped alot: I enabled deadline block scheduler instead of the default cfq on the host system. tested with: Host Debian with scheduler deadline, Guest Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost measured) Maybe this is also true for Linux/Linux. I expect that scheduler noop for linux guests would be good. Hmm. I wonder why it helped. In theory, host scheduler should not change anything for cache=none case, at least for raw partitions of LVM volumes. This is because with cache=none, the virtual disk image is opened with O_DIRECT flag, which means all I/O bypasses host scheduler and buffer cache. I tried a few quick tests here, -- with LVM volumes it makes no measurable difference. But if the guest disk images are on plain files (also raw), scheduler makes some difference, and indeed deadline works better. Maybe you were testing with plain files instead of block devices? ah yes, qcow2 images. - Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Gleb Natapov wrote: On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: Avi Kivity wrote: On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Thanks for the information. So the point is the vcpu state that can been observed from qemu upon KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not complete/consistent? Definitely. VCPU is in the middle of an instruction execution, so the state is undefined. One instruction may generate more then one IO exit during its execution BTW. Regarding the multiple IO exits, we're paying attention too. Although it depends on the guest behavior, if we limit the device model, one IO exit per one instruction may be practical at beggining. But thanks for pointing out. To solve the undefined VCPU state, how about keeping a copy of initial state upon VMEXIT? I guess there already is a similar shadow state in KVM. If possible we can allocate another one for this purpose. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
Am Thu, 08 Apr 2010 06:09:05 + schrieb Thomas Mueller: Am Thu, 08 Apr 2010 10:05:09 +0400 schrieb Michael Tokarev: 08.04.2010 09:07, Thomas Mueller wrote: [] This helped alot: I enabled deadline block scheduler instead of the default cfq on the host system. tested with: Host Debian with scheduler deadline, Guest Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost measured) Maybe this is also true for Linux/Linux. I expect that scheduler noop for linux guests would be good. Hmm. I wonder why it helped. In theory, host scheduler should not change anything for cache=none case, at least for raw partitions of LVM volumes. This is because with cache=none, the virtual disk image is opened with O_DIRECT flag, which means all I/O bypasses host scheduler and buffer cache. I tried a few quick tests here, -- with LVM volumes it makes no measurable difference. But if the guest disk images are on plain files (also raw), scheduler makes some difference, and indeed deadline works better. Maybe you were testing with plain files instead of block devices? ah yes, qcow2 images. ... but does the scheduler really now about O_DIRECT? isn't O_DIRECT meant to bypass only buffers (aka return write not before it really hit the disk)? my understanding is that the scheduler is layer down the stack. but only guessing - i'm not a kernel hacker. :) - Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. All you need is some consistent sate to restart VM from, no? So if you transfer VM state after instruction that caused IO is completed you can restart VM on secondary from that state in case primary fails. I guess my question is: Can you make synchronization point to be immediately after IO instruction instead of before? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote: The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. You have two choices: - complete execution of the instruction in both the kernel and the device model This is what live migration does currently. Any mmio and pio requests are completed, the last instruction is finalized, and state is saved. - complete execution of the instruction in the kernel, but queue execution of mmio/pio requests This is more in line with what you describe. vcpu state will be after the instruction, device model state will be before instruction completion, when you replay the queue, the device model state will be consistent with the vcpu state. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote: On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote: The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. You have two choices: - complete execution of the instruction in both the kernel and the device model This is what live migration does currently. Any mmio and pio requests are completed, the last instruction is finalized, and state is saved. - complete execution of the instruction in the kernel, but queue execution of mmio/pio requests This is more in line with what you describe. vcpu state will be after the instruction, device model state will be before instruction completion, when you replay the queue, the device model state will be consistent with the vcpu state. For in or mmio read you can't complete instruction without doing actual IO. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. -- error compiling committee.c: too many arguments to function -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
Avi Kivity wrote: On 04/07/2010 11:24 PM, Marcelo Tosatti wrote: During initialization, WinXP.32 switches to virtual-8086 mode, with paging enabled, to use VGABIOS functions. Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS flags = vmcs_readl(GUEST_RFLAGS); flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM); flags |= (vmx-rmode.save_iopl IOPL_SHIFT); vmcs_writel(GUEST_RFLAGS, flags); Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. I think we have a small related bug in realmode emulation - we run the guest with iopl=3. This means the guest can use pushfl and see the host iopl instead of the guest iopl. We should run with iopl=0, which causes pushfl/popfl to #GP, where we can emulate the flags correctly (by updating rmode.save_iopl and rmode.save_vm). That has lots of implications however... And the order of loading state is set_regs (rflags) followed by set_sregs (cr0), these bits are lost across save/restore: savevm 1 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=33286 system_reset loadvm 1 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=10286 cont kvm: unhandled exit 8021 kvm_run returned -22 The following patch fixes it, but it has some drawbacks: - cpu_synchronize_state+writeback is noticeably slow with tpr patching, this makes it slower. Isn't it a very rare event? It has to be - otherwise the decision to go for full sync and individual get/set IOCTL would have been wrong. What happens during tpr patching? - Should be conditional on VMX !unrestricted guest. Userspace should know nothing of this mess. - Its a fugly workaround. True. Still likely the way to go for old kernels. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
On 04/08/2010 02:13 AM, Richard Simpson wrote: gordon Code # ./check-nx nx: enabled gordon Code # OK, seems to be enabled just fine. Any other ideas? I am beginning to get that horrible feeling that there isn't a real problem and it is just me being dumb! I really hope so, because I am out of ideas... :) Can you verify check-nx returns disabled on the guest? Does /proc/cpuinfo show nx in the guest? OK, time for a summary: Host: /proc/cpuinfo shows 'nx' and check-nx shows 'enabled' Guest: /proc/cpuinfo doesn't show nx and check-nx shows 'disabled' Strange. Can you hack qemu-kvm's cpuid code where it issues the ioctl KVM_SET_CPUID2 to show what the data is? I'm not where that code is in your version of qemu-kvm. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
On 04/08/2010 10:22 AM, Jan Kiszka wrote: Avi Kivity wrote: On 04/07/2010 11:24 PM, Marcelo Tosatti wrote: During initialization, WinXP.32 switches to virtual-8086 mode, with paging enabled, to use VGABIOS functions. Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS flags = vmcs_readl(GUEST_RFLAGS); flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM); flags |= (vmx-rmode.save_iopl IOPL_SHIFT); vmcs_writel(GUEST_RFLAGS, flags); Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. This is vendor specific code, and it isn't manipulating guest values, only host values (-set_rflags() is called when the guest value changes, which isn't happening here). Of course some refactoring will be helpful here. The following patch fixes it, but it has some drawbacks: - cpu_synchronize_state+writeback is noticeably slow with tpr patching, this makes it slower. Isn't it a very rare event? It has to be - otherwise the decision to go for full sync and individual get/set IOCTL would have been wrong. What happens during tpr patching? tpr patching listens for instructions which access the tpr and patches them to a call instruction (targeting some hacky code in the bios). Since there are a limited number of such instructions (20-30 IIRC) you expect tpr patching to happen very rarely. - Its a fugly workaround. True. Still likely the way to go for old kernels. It's a bugfix that can go into -stable and supported distribution kernels. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Gleb Natapov wrote: On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. All you need is some consistent sate to restart VM from, no? So if you transfer VM state after instruction that caused IO is completed you can restart VM on secondary from that state in case primary fails. I guess my question is: Can you make synchronization point to be immediately after IO instruction instead of before? To answer your question, it should be possible to implement. The down side is that after going into KVM to make the guest state to consistent, we need to go back to qemu to actually transfer the guest, and this bounce would introduce another overhead if I'm understanding correctly. And yes, all I need is some consistent state to resume VM from, which must be able to continue I/O operations, like writing to disks and sending ack over the network. If I can guarantee this, sending the VM state after completing output is acceptable. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote: To answer your question, it should be possible to implement. The down side is that after going into KVM to make the guest state to consistent, we need to go back to qemu to actually transfer the guest, and this bounce would introduce another overhead if I'm understanding correctly. Yes. It should be around a microsecond or so, given you will issue I/O after this I don't think this will affect performance. And yes, all I need is some consistent state to resume VM from, which must be able to continue I/O operations, like writing to disks and sending ack over the network. If I can guarantee this, sending the VM state after completing output is acceptable. I suggest you start with this. If it turns out performance is severely impacted, we can revisit instruction completion. If performance is satisfactory, then we'll be able to run Kemari with older kernels. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
Avi Kivity wrote: On 04/08/2010 10:22 AM, Jan Kiszka wrote: Avi Kivity wrote: On 04/07/2010 11:24 PM, Marcelo Tosatti wrote: During initialization, WinXP.32 switches to virtual-8086 mode, with paging enabled, to use VGABIOS functions. Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS flags = vmcs_readl(GUEST_RFLAGS); flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM); flags |= (vmx-rmode.save_iopl IOPL_SHIFT); vmcs_writel(GUEST_RFLAGS, flags); Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. This is vendor specific code, and it isn't manipulating guest values, only host values (-set_rflags() is called when the guest value changes, which isn't happening here). Of course some refactoring will be helpful here. Actually, the bug is that enter_pmode/rmode update save_iopl (and that no one saves the VM bit). That should happen in vmx_set_rflags to also keep track of changes _while_ we are in rmode. enter_rmode/pmode should just trigger a set_rflags to update things. And vmx_get_rflags must properly inject the saved flags instead of masking them out. The following patch fixes it, but it has some drawbacks: - cpu_synchronize_state+writeback is noticeably slow with tpr patching, this makes it slower. Isn't it a very rare event? It has to be - otherwise the decision to go for full sync and individual get/set IOCTL would have been wrong. What happens during tpr patching? tpr patching listens for instructions which access the tpr and patches them to a call instruction (targeting some hacky code in the bios). Since there are a limited number of such instructions (20-30 IIRC) you expect tpr patching to happen very rarely. Then I wonder why it is noticeable. - Its a fugly workaround. True. Still likely the way to go for old kernels. It's a bugfix that can go into -stable and supported distribution kernels. Well, would be happy to throw out tones of workaround based on this approach. :) Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
On 04/08/2010 10:54 AM, Jan Kiszka wrote: Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. This is vendor specific code, and it isn't manipulating guest values, only host values (-set_rflags() is called when the guest value changes, which isn't happening here). Of course some refactoring will be helpful here. Actually, the bug is that enter_pmode/rmode update save_iopl (and that no one saves the VM bit). That should happen in vmx_set_rflags to also keep track of changes _while_ we are in rmode. Exactly - that's what I suggested above. enter_rmode/pmode should just trigger a set_rflags to update things. Not what I had in mind, but a valid implementation. And vmx_get_rflags must properly inject the saved flags instead of masking them out. Yes. No one ever bothers to play with iopl in real mode, so we never noticed this. We do this for cr0 for example. It's a bugfix that can go into -stable and supported distribution kernels. Well, would be happy to throw out tones of workaround based on this approach. :) And I'll be happy to apply such patches. Just ensure that 2.6.32.y and above have the fixes so we don't introduce regressions (I think most workarounds are a lot older). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Gleb Natapov wrote: On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote: On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote: The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. You have two choices: - complete execution of the instruction in both the kernel and the device model This is what live migration does currently. Any mmio and pio requests are completed, the last instruction is finalized, and state is saved. - complete execution of the instruction in the kernel, but queue execution of mmio/pio requests This is more in line with what you describe. vcpu state will be after the instruction, device model state will be before instruction completion, when you replay the queue, the device model state will be consistent with the vcpu state. For in or mmio read you can't complete instruction without doing actual IO. So, if the mmio/pio requests in the queue are only out or mmio write Avi's suggestion No.2 would work. But if in or mmio read are mixed with these, (We don't have to think if the queue is filled with only in or mmio read because we're currently transferring only in case of out or mmio write) the story gets complicated. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. I agree. What I'm wondering is how can we guarantee that the responses are the same... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Avi Kivity wrote: On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote: To answer your question, it should be possible to implement. The down side is that after going into KVM to make the guest state to consistent, we need to go back to qemu to actually transfer the guest, and this bounce would introduce another overhead if I'm understanding correctly. Yes. It should be around a microsecond or so, given you will issue I/O after this I don't think this will affect performance. That is a good news. And yes, all I need is some consistent state to resume VM from, which must be able to continue I/O operations, like writing to disks and sending ack over the network. If I can guarantee this, sending the VM state after completing output is acceptable. I suggest you start with this. If it turns out performance is severely impacted, we can revisit instruction completion. If performance is satisfactory, then we'll be able to run Kemari with older kernels. I was almost to say yes here, but let me ask one more question. BTW, thank you two for taking time for this discussion which isn't a topic on KVM itself. If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to the client, and if a hardware failure occurred to the primary during the VM transferring *but the client received the TCP ACK*, the secondary will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to the secondary. It looks this data is going to be dropped. Am I missing some point here? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote: If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to the client, and if a hardware failure occurred to the primary during the VM transferring *but the client received the TCP ACK*, the secondary will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to the secondary. It looks this data is going to be dropped. Am I missing some point here? I think you should block I/O not at the cpu/device boundary (that's inefficient as many cpu I/O instructions don't necessarily cause externally visible I/O) but at the device level. Whenever the network device wants to send out a packet, halt the guest (letting any I/O instructions complete), synchronize the secondary, and then release the pending I/O. This ensures that the secondary has all of the data prior to the ack being sent out. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 11:10 AM, Yoshiaki Tamura wrote: If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. I agree. What I'm wondering is how can we guarantee that the responses are the same... I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
On Wed, Apr 07, 2010 at 02:07:18PM -0700, David Stevens wrote: kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM: On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote: Thanks! There's some whitespace damage, are you sending with your new sendmail setup? It seems to have worked for qemu patches ... Yes, I saw some line wraps in what I received, but I checked the original draft to be sure and they weren't there. Possibly from the relay; Sigh. @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { +vhost_discard_desc(vq, 1); +tx_poll_start(net, sock); + } else { +vq_err(vq, sendmsg: errno %d\n, -err); +/* drop packet; do not discard/resend */ +vhost_add_used_and_signal(net-dev, vq, head, + 0); vhost does not currently has a consistent error handling strategy: if we drop packets, need to think which other errors should cause packet drops. I prefer to just call vq_err for now, and have us look at handling segfaults etc in a consistent way separately. I had to add this to avoid an infinite loop when I wrote a bad packet on the socket. I agree error handling needs a better look, but retrying a bad packet continuously while dumping in the log is what it was doing when I hit an error before this code. Isn't this better, at least until a second look? Hmm, what do you mean 'continuously'? Don't we only try again on next kick? If the packet is corrupt (in my case, a missing vnet header during testing), every send will fail and we never make progress. I had thousands of error messages in the log (for the same packet) before I added this code. Hmm, we do not want a buggy guest to be able to fill host logs. This is only if debugging is enabled though, right? We can also rate-limit the errors. If the problem is with the packet, retrying the same one as the original code does will never recover. This isn't required for mergeable rx buffer support, so I can certainly remove it from this patch, but I think the original error handling doesn't handle a single corrupted packet very gracefully. An approach I considered was to have qemu poll vq_err fd and stop the device when an error is seen. My concern with dropping a tx packet is that it would make debugging very difficult. @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, datalen, in, +vq_log, log); This looks like a bug, I think we need to pass datalen + header size to vhost_get_desc_n. Not sure how we know the header size that backend will use though. Maybe just look at our features. Yes; we have hdr_size, so I can add it here. It'll be 0 for the cases where the backend and guest both have vnet header (either the regular or larger mergeable buffers one), but should be added in for future raw socket support. So hdr_size is the wrong thing to add then. We need to add non-zero value for tap now. datalen includes the vnet_hdr in the tap case, so we don't need a non-zero hdr_size. The socket data has the entire packet and vnet_hdr and that length is what we're getting from vhost_head_len(). I only see vhost_head_len returning skb-len. You are sure skb-len includes vnet_hdr for tap rx? /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { - if (unlikely(vhost_enable_notify(vq))) { + if (!headcount) { + if (retries == 0 unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ vhost_disable_notify(vq); +retries++; continue; } Hmm. The reason we have the code at all, as the comment says, is because guest could have added more buffers between the time we read last index and the time we enabled notification. So if we just break like this the race still exists. We could remember the last head value we observed, and have
Re: Setting nx bit in virtual CPU
Avi Kivity wrote: On 04/07/2010 11:38 PM, Richard Simpson wrote: On 07/04/10 13:23, Avi Kivity wrote: Run as root, please. And check first that you have a file named /dev/cpu/0/msr. Doh! gordon Code # ./check-nx nx: enabled gordon Code # OK, seems to be enabled just fine. Any other ideas? I am beginning to get that horrible feeling that there isn't a real problem and it is just me being dumb! I really hope so, because I am out of ideas... :) Can you verify check-nx returns disabled on the guest? Does /proc/cpuinfo show nx in the guest? Can you try to boot the attached multiboot kernel, which just outputs a brief CPUID dump? $ qemu-kvm -kernel cpuid_mb -vnc :0 (Unfortunately I have no serial console support in there yet, so you either have to write the values down or screenshot it). In the 4th line from the button it should print NX (after SYSCALL). Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 cpuid_mb Description: Binary data
Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com --- Michael, This is a small patch for the write logging issue with async queue. I have made a __vhost_get_vq_desc() func which may compute the log info with any valid buffer index. The __vhost_get_vq_desc() is coming from the code in vq_get_vq_desc(). And I use it to recompute the log info when logging is enabled. Thanks Xiaohui drivers/vhost/net.c | 27 --- drivers/vhost/vhost.c | 115 - drivers/vhost/vhost.h |5 ++ 3 files changed, 90 insertions(+), 57 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 2aafd90..00a45ef 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -115,7 +115,8 @@ static void handle_async_rx_events_notify(struct vhost_net *net, struct kiocb *iocb = NULL; struct vhost_log *vq_log = NULL; int rx_total_len = 0; - int log, size; + unsigned int head, log, in, out; + int size; if (vq-link_state != VHOST_VQ_LINK_ASYNC) return; @@ -130,14 +131,25 @@ static void handle_async_rx_events_notify(struct vhost_net *net, iocb-ki_pos, iocb-ki_nbytes); log = (int)iocb-ki_user_data; size = iocb-ki_nbytes; + head = iocb-ki_pos; rx_total_len += iocb-ki_nbytes; if (iocb-ki_dtor) iocb-ki_dtor(iocb); kmem_cache_free(net-cache, iocb); - if (unlikely(vq_log)) + /* when log is enabled, recomputing the log info is needed, +* since these buffers are in async queue, and may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); vhost_log_write(vq, vq_log, log, size); + } if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; @@ -313,14 +325,13 @@ static void handle_rx(struct vhost_net *net) vhost_disable_notify(vq); hdr_size = vq-hdr_size; - /* In async cases, for write logging, the simple way is to get -* the log info always, and really logging is decided later. -* Thus, when logging enabled, we can get log, and when logging -* disabled, we can get log disabled accordingly. + /* In async cases, when write log is enabled, in case the submitted +* buffers did not get log info before the log enabling, so we'd +* better recompute the log info when needed. We do this in +* handle_async_rx_events_notify(). */ - vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) | - (vq-link_state == VHOST_VQ_LINK_ASYNC) ? + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; handle_async_rx_events_notify(net, vq); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 97233d5..53dab80 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -715,66 +715,21 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq, return 0; } -/* This looks in the virtqueue and for the first available buffer, and converts - * it to an iovec for convenient access. Since descriptors consist of some - * number of output then some number of input descriptors, it's actually two - * iovecs, but we pack them into one and note how many of each there were. - * - * This function returns the descriptor number found, or vq-num (which - * is never a valid descriptor number) if none was found. */ -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq, +unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq, struct iovec iov[], unsigned int iov_size, unsigned int *out_num, unsigned int *in_num, - struct vhost_log *log, unsigned int *log_num) + struct vhost_log *log, unsigned int *log_num, + unsigned int head) { struct vring_desc desc; - unsigned int i, head, found = 0; - u16 last_avail_idx; + unsigned int i = head, found = 0; int ret; - /* Check it isn't doing very strange things with descriptor numbers. */ - last_avail_idx = vq-last_avail_idx; - if (get_user(vq-avail_idx, vq-avail-idx)) { - vq_err(vq, Failed to access avail idx at %p\n, - vq-avail-idx); - return
Re: Question on skip_emulated_instructions()
Avi Kivity wrote: On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote: If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to the client, and if a hardware failure occurred to the primary during the VM transferring *but the client received the TCP ACK*, the secondary will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to the secondary. It looks this data is going to be dropped. Am I missing some point here? I think you should block I/O not at the cpu/device boundary (that's inefficient as many cpu I/O instructions don't necessarily cause externally visible I/O) but at the device level. Whenever the network device wants to send out a packet, halt the guest (letting any I/O instructions complete), synchronize the secondary, and then release the pending I/O. This ensures that the secondary has all of the data prior to the ack being sent out. Although I was thinking to clean up my current code, maybe I should post the current status for explanation now. As you mentioned, I'm capturing I/O at the device level, by inserting a hook inside of PIO/MMIO handler in virtio-blk, virtio-net and e1000 emulator. Since it's implemented naively, it'll stop (meaning I/O instructions will be delayed) until transferring the VM is done. So what I can do here is, 1. Let I/O instructions to complete both at qemu and kvm. 2. Transfer the guest state. # VCPU and device model thinks I/O emulation is already done. 3. Finally release the pending output to the real world. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. I agree. What I'm wondering is how can we guarantee that the responses are the same... I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? Yes, it should. To implement this, we need to make No.3 to be called asynchronously. If qemu is already handling I/O asynchronously, it would be relatively easy to make this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
On Thu, Apr 08, 2010 at 10:05:09AM +0400, Michael Tokarev wrote: LVM volumes. This is because with cache=none, the virtual disk image is opened with O_DIRECT flag, which means all I/O bypasses host scheduler and buffer cache. O_DIRECT does not bypass the I/O scheduler, only the page cache. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote: I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? Yes, it should. To implement this, we need to make No.3 to be called asynchronously. If qemu is already handling I/O asynchronously, it would be relatively easy to make this. Yes, you can release the I/O from the iothread instead of the vcpu thread. You can make virtio_net_handle_tx() disable virtio notifications and initiate state sync and return, when state sync continues you can call the original virtio_net_handle_tx(). If the secondary takes over, it needs to call the original virtio_net_handle_tx() as well. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
2010/4/8 Avi Kivity a...@redhat.com: On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote: I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? Yes, it should. To implement this, we need to make No.3 to be called asynchronously. If qemu is already handling I/O asynchronously, it would be relatively easy to make this. Yes, you can release the I/O from the iothread instead of the vcpu thread. You can make virtio_net_handle_tx() disable virtio notifications and initiate state sync and return, when state sync continues you can call the original virtio_net_handle_tx(). If the secondary takes over, it needs to call the original virtio_net_handle_tx() as well. Agreed. Let me try it. Meanwhile, I'll post what I have done including the hack preventing rip to proceed. I would appreciate if you could comment on that too, to keep things in a good direction. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 04:42 PM, Yoshiaki Tamura wrote: Yes, you can release the I/O from the iothread instead of the vcpu thread. You can make virtio_net_handle_tx() disable virtio notifications and initiate state sync and return, when state sync continues you can call the original virtio_net_handle_tx(). If the secondary takes over, it needs to call the original virtio_net_handle_tx() as well. Agreed. Let me try it. Meanwhile, I'll post what I have done including the hack preventing rip to proceed. I would appreciate if you could comment on that too, to keep things in a good direction. Certainly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM Test: Add function run_autotest_background and wait_autotest_background.
On 04/07/2010 11:49 AM, Feng Yang wrote: Add function run_autotest_background and wait_autotest_background to kvm_test_utils.py. This two functions is used in ioquit test script. Signed-off-by: Feng Yang fy...@redhat.com --- client/tests/kvm/kvm_test_utils.py | 68 +++- 1 files changed, 67 insertions(+), 1 deletions(-) diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index f512044..2a1054e 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -21,7 +21,7 @@ More specifically: @copyright: 2008-2009 Red Hat Inc. -import time, os, logging, re, commands +import time, os, logging, re, commands, sys from autotest_lib.client.common_lib import error from autotest_lib.client.bin import utils import kvm_utils, kvm_vm, kvm_subprocess, scan_results @@ -402,3 +402,69 @@ def run_autotest(vm, session, control_path, timeout, test_name, outputdir): result = bad_results[0] raise error.TestFail(Test '%s' ended with %s (reason: '%s') % (result[0], result[1], result[3])) + + +def run_autotest_background(vm, session, control_path, timeout, test_name, +outputdir): + +Wrapper of run_autotest() and make it run in the background through fork() +and let it run in the child process. +1) Flush the stdio. +2) Build test params which is recevied from arguments and used by + run_autotest() +3) Fork the process and let the run_autotest() run in the child +4) Catch the exception raise by run_autotest() and exit the child with + non-zero return code. +5) If no exception catched, reutrn 0 + +@param vm: VM object. +@param session: A shell session on the VM provided. +@param control: An autotest control file. +@param timeout: Timeout under which the autotest test must complete. +@param test_name: Autotest client test name. +@param outputdir: Path on host where we should copy the guest autotest +results to. + + +def flush(): +sys.stdout.flush() +sys.stderr.flush() + +logging.info(Running autotest background ...) +flush() +pid = os.fork() +if pid: +# Parent process +return pid + +try: +# Launch autotest +logging.info(child process of run_autotest_background) +run_autotest(vm, session, control_path, timeout, test_name, outputdir) +except error.TestFail, message_fail: +logging.info([Autotest Background FAIL] %s % message_fail) +os._exit(1) +except error.TestError, message_error: +logging.info([Autotest Background ERROR] %s % message_error) +os._exit(2) +except: +os._exit(3) + +logging.info([Auototest Background GOOD]) +os._exit(0) + + +def wait_autotest_background(pid): + +Wait for background autotest finish. + +@param pid: Pid of the child process executing background autotest + +logging.info(Waiting for background autotest to finish ...) + +(pid, s) = os.waitpid(pid,0) +status = os.WEXITSTATUS(s) +if status != 0: +return False +return True + I think these functions are unnecessary. IMO forking is not the clean way of running autotest in the background. The kvm_shell_session object, used to run autotest in the guest, by default runs things in the background (e.g. session.sendline() returns immediately). run_autotest(), which uses kvm_shell_session, blocks until the autotest test is done. So in order to run autotest in the background, we should modify run_autotest(), or break it up into smaller parts, to make it nonblocking. There's no need to implement yet another wrapper. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Unify KVM kernel-space and user-space code into a single project
Avi Kivity wrote: On 03/24/2010 06:40 PM, Joerg Roedel wrote: Looks trivial to find a guest, less so with enumerating (still doable). Not so trival and even more likely to break. Even it perf has the pid of the process and wants to find the directory it has to do: 1. Get the uid of the process 2. Find the username for the uid 3. Use the username to find the home-directory Steps 2. and 3. need nsswitch and/or pam access to get this information from whatever source the admin has configured. And depending on what the source is it may be temporarily unavailable causing nasty timeouts. In short, there are many weak parts in that chain making it more likely to break. It's true. If the kernel provides something, there are fewer things that can break. But if your system is so broken that you can't resolve uids, fix that before running perf. Must we design perf for that case? uid to username can fail when using chroots, or worse point to an incorrect location (and yes, I do use this) Sorry if this has been covered / discussion has moved on. Just catching up with the 500+ messages in my inbox.. Antoine After all, 'ls -l' will break under the same circumstances. It's hard to imagine doing useful work when that doesn't work. A kernel-based approach with /proc/pid/kvm does not have those issues (and to repeat myself, it is independent from the userspace being used). It has other issues, which are IMO more problematic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM Test: Add ioquit test case
On 04/07/2010 11:49 AM, Feng Yang wrote: Signed-off-by: Feng Yang fy...@redhat.com --- client/tests/kvm/tests/ioquit.py | 54 client/tests/kvm/tests_base.cfg.sample |4 ++ 2 files changed, 58 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/ioquit.py diff --git a/client/tests/kvm/tests/ioquit.py b/client/tests/kvm/tests/ioquit.py new file mode 100644 index 000..c75a0e3 --- /dev/null +++ b/client/tests/kvm/tests/ioquit.py @@ -0,0 +1,54 @@ +import logging, time, random, signal, os +from autotest_lib.client.common_lib import error +import kvm_test_utils, kvm_utils + + +def run_ioquit(test, params, env): + +Emulate the poweroff under IO workload(dbench so far) using monitor +command 'quit'. + +@param test: Kvm test object +@param params: Dictionary with the test parameters. +@param env: Dictionary with test environment. + - Can you explain the goal of this test? Why quit a VM under IO workload? What results do you expect? How can the test ever fail? - Why is dbench any better for this purpose than dd or some other simple command? Using dbench isn't necessarily bad, I'm just curious. +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +session = kvm_test_utils.wait_for_login(vm, + timeout=int(params.get(login_timeout, 360))) +session2 = kvm_test_utils.wait_for_login(vm, + timeout=int(params.get(login_timeout, 360))) +def is_autotest_launched(): +if session.get_command_status(pgrep autotest) != 0: +logging.debug(Autotest process not found) +return False +return True + +test_name = params.get(background_test, dbench) +control_file = params.get(control_file, dbench.control) +timeout = int(params.get(test_timeout, 300)) +control_path = os.path.join(test.bindir, autotest_control, +control_file) +outputdir = test.outputdir + +pid = kvm_test_utils.run_autotest_background(vm, session2, control_path, + timeout, test_name, + outputdir) As mentioned in the other message, I don't think it's necessary to fork and use run_autotest() in a separate process. Instead we should modify run_autotest() to support non-blocking operation (if we need that at all). +if pid 0: +raise error.TestError(Could not create child process to execute + autotest background) + +if kvm_utils.wait_for(is_autotest_launched, 240, 0, 2): +logging.debug(Background autotest successfully) +else: +logging.debug(Background autotest failed, start the test anyway) + +time.sleep(100 + random.randrange(0,100)) +logging.info(Kill the virtual machine) +vm.process.close() This will do a 'kill -9' on the qemu process. Didn't you intend to use a 'quit'? To do that, you should use vm.destroy(gracefully=False). +logging.info(Kill the tracking process) +kvm_utils.safe_kill(pid, signal.SIGKILL) +kvm_test_utils.wait_autotest_background(pid) +session.close() +session2.close() + diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 9b12fc2..d8530f6 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -305,6 +305,10 @@ variants: - ksm_parallel: ksm_mode = parallel +- ioquit: +type = ioquit +control_file = dbench.control.200 +background_test = dbench You should probably add extra_params += -snapshot because this test can break the filesystem. # system_powerdown, system_reset and shutdown *must* be the last ones # defined (in this order), since the effect of such tests can leave # the VM on a bad state. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
On Thu, Apr 08, 2010 at 11:05:56AM +0300, Avi Kivity wrote: On 04/08/2010 10:54 AM, Jan Kiszka wrote: Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. This is vendor specific code, and it isn't manipulating guest values, only host values (-set_rflags() is called when the guest value changes, which isn't happening here). Of course some refactoring will be helpful here. Actually, the bug is that enter_pmode/rmode update save_iopl (and that no one saves the VM bit). That should happen in vmx_set_rflags to also keep track of changes _while_ we are in rmode. Exactly - that's what I suggested above. And new ioctl to save/restore save_iopl/save_vm. enter_rmode/pmode should just trigger a set_rflags to update things. Not what I had in mind, but a valid implementation. And vmx_get_rflags must properly inject the saved flags instead of masking them out. Yes. No one ever bothers to play with iopl in real mode, so we never noticed this. We do this for cr0 for example. It's a bugfix that can go into -stable and supported distribution kernels. Well, would be happy to throw out tones of workaround based on this approach. :) Do you mean you'd be interested in writing the patch? Sure, go ahead, let me know otherwise. And I'll be happy to apply such patches. Just ensure that 2.6.32.y and above have the fixes so we don't introduce regressions (I think most workarounds are a lot older). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
On 04/08/2010 05:16 PM, Marcelo Tosatti wrote: On Thu, Apr 08, 2010 at 11:05:56AM +0300, Avi Kivity wrote: On 04/08/2010 10:54 AM, Jan Kiszka wrote: Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. This is vendor specific code, and it isn't manipulating guest values, only host values (-set_rflags() is called when the guest value changes, which isn't happening here). Of course some refactoring will be helpful here. Actually, the bug is that enter_pmode/rmode update save_iopl (and that no one saves the VM bit). That should happen in vmx_set_rflags to also keep track of changes _while_ we are in rmode. Exactly - that's what I suggested above. And new ioctl to save/restore save_iopl/save_vm. That ioctl already exists, KVM_{GET,SET}_REGS. We're writing via KVM_SET_SREGS eflags.vm=1 and eflags.iopl=3 while cr0.pe=0. vmx_set_rflags() notices this and sets rmode.save_vm=1 and rmode.save_iopl=3. Next we write via KVM_SET_SREGS cr0.pe=1. So we call enter_pmode(), and recover eflags.vm and eflags.iopl from rmode.vm and rmode.iopl. Win! It's similar to how we handle cr0.ts, sometimes the host owns it so we keep it in a shadow register, sometimes the guest owns it so we keep it in cr0. It's a bugfix that can go into -stable and supported distribution kernels. Well, would be happy to throw out tones of workaround based on this approach. :) Do you mean you'd be interested in writing the patch? Sure, go ahead, let me know otherwise. I took it to mean he wants to kill the other qemu workarounds for kernel bugs. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
On Thu, Apr 08, 2010 at 09:54:35AM +0200, Jan Kiszka wrote: The following patch fixes it, but it has some drawbacks: - cpu_synchronize_state+writeback is noticeably slow with tpr patching, this makes it slower. Isn't it a very rare event? It has to be - otherwise the decision to go for full sync and individual get/set IOCTL would have been wrong. What happens during tpr patching? tpr patching listens for instructions which access the tpr and patches them to a call instruction (targeting some hacky code in the bios). Since there are a limited number of such instructions (20-30 IIRC) you expect tpr patching to happen very rarely. Then I wonder why it is noticeable. While switching kvm-tpr-opt.c from explicit {get,put}_{s}regs to cpu_synchronize_state+writeback i noticed WinXP.32 boot became visually slower. For some reason, the delay introduced by cpu_synchronize_state+writeback forbids patching certain instructions for longer periods, or somehow allows Windows to use unpatched instructions more often /guess. End result was 4x more patching (from 700 to 4000, roughly). Confirmed it was a timing issue by introducing delays to original {get,put}_{s}regs version. The particular tpr case is no big deal since as mentioned its a short lived period, but for things like Kemari this might be an issue. But this is another discussion. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
Marcelo Tosatti wrote: On Thu, Apr 08, 2010 at 11:05:56AM +0300, Avi Kivity wrote: On 04/08/2010 10:54 AM, Jan Kiszka wrote: Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? Just like we manipulate the flags for guest debugging in the set/get_rflags vendor handlers, the same should happen for IOPL and VM. This is no business of enter_pmode/rmode. This is vendor specific code, and it isn't manipulating guest values, only host values (-set_rflags() is called when the guest value changes, which isn't happening here). Of course some refactoring will be helpful here. Actually, the bug is that enter_pmode/rmode update save_iopl (and that no one saves the VM bit). That should happen in vmx_set_rflags to also keep track of changes _while_ we are in rmode. Exactly - that's what I suggested above. And new ioctl to save/restore save_iopl/save_vm. Not need. The information will all be contained in eflags and cr0 as returned to userspace. The bug is that the wrong information is currently returned, thus saved/restored. enter_rmode/pmode should just trigger a set_rflags to update things. Not what I had in mind, but a valid implementation. And vmx_get_rflags must properly inject the saved flags instead of masking them out. Yes. No one ever bothers to play with iopl in real mode, so we never noticed this. We do this for cr0 for example. It's a bugfix that can go into -stable and supported distribution kernels. Well, would be happy to throw out tones of workaround based on this approach. :) Do you mean you'd be interested in writing the patch? Sure, go ahead, let me know otherwise. ATM, I fighting against too many customer bugs. And you have the test case for this particular issue, I assume. So don't wait for me here. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GSoC 2010] Pass-through filesystem support.
Hi, Now that Cam is almost done with his ivshmem patches, I was thinking of another idea for GSoC which is improving the pass-though filesystems. I've got some questions on that: 1- What does the community prefer to use and improve? CIFS, 9p, or both? And which is better taken up for GSoC. 2- With respect to CIFS. I wonder how the shares are supposed to be exposed to the guest. Should the Samba server be modified to be able to use unix domain sockets instead of TCP ports and then QEMU communicating on these sockets. With that approach, how should the guest be able to see the exposed share? And what is the problem of using Samba with TCP ports? 3- In addition, I see the idea mentions that some Windows code needs to be written to use network shares on a special interface. What's that interface? And what's the nature of that Windows code? (a driver a la guest additions?) Regards, Mohammed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC 2010] Pass-through filesystem support.
On Thu, Apr 8, 2010 at 6:01 PM, Mohammed Gamal m.gamal...@gmail.com wrote: Hi, Now that Cam is almost done with his ivshmem patches, I was thinking of another idea for GSoC which is improving the pass-though filesystems. I've got some questions on that: 1- What does the community prefer to use and improve? CIFS, 9p, or both? And which is better taken up for GSoC. 2- With respect to CIFS. I wonder how the shares are supposed to be exposed to the guest. Should the Samba server be modified to be able to use unix domain sockets instead of TCP ports and then QEMU communicating on these sockets. With that approach, how should the guest be able to see the exposed share? And what is the problem of using Samba with TCP ports? 3- In addition, I see the idea mentions that some Windows code needs to be written to use network shares on a special interface. What's that interface? And what's the nature of that Windows code? (a driver a la guest additions?) Regards, Mohammed P.S.: A gentle reminder. The proposal submission deadline is tomorrow, so I'd appreciate responses as soon as possible. Regards, Mohammed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] vhost-blk implementation
On Fri, Mar 26, 2010 at 6:53 PM, Eran Rom er...@il.ibm.com wrote: Christoph Hellwig hch at infradead.org writes: Ok. cache=writeback performance is something I haven't bothered looking at at all. For cache=none any streaming write or random workload with large enough record sizes got basically the same performance as native using kernel aio, and same for write but slightly degraded for reads using the thread pool. See my attached JLS presentation for some numbers. Looks like the presentation did not make it... I am interested in the JLS presentation too. Here is what I found, hope it's the one you meant, Christoph: http://events.linuxfoundation.org/images/stories/slides/jls09/jls09_hellwig.odp Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC 2010] Pass-through filesystem support.
On Thu, Apr 8, 2010 at 5:02 PM, Mohammed Gamal m.gamal...@gmail.com wrote: On Thu, Apr 8, 2010 at 6:01 PM, Mohammed Gamal m.gamal...@gmail.com wrote: 1- What does the community prefer to use and improve? CIFS, 9p, or both? And which is better taken up for GSoC. There have been recent patches for filesystem passthrough using 9P: http://www.mail-archive.com/qemu-de...@nongnu.org/msg28100.html You might want to consider them if you haven't seen them already. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Problem with KVM guest switching to x86 long mode
Hi! I am working on a light-weight KVM userspace launcher for Linux and am bit stuck with a guest Linux kernel restarting when it tries to enter long mode. The register dump looks like this: penb...@tiger:~/vm$ ./kvm bzImage KVM exit reason: 8 (KVM_EXIT_SHUTDOWN) Registers: rip: 001000ed rsp: 005d54b8 flags: 00010046 rax: 8001 rbx: 01f2c000 rcx: c080 rdx: rsi: 00013670 rdi: 02408000 rbp: 0010 r8: r9: r10: r11: r12: r13: r14: r15: cr0: 8011 cr2: 001000ed cr3: 02402000 cr4: 0020 cr8: Segment registers: register selector base limit type p dpl db s l g avl cs0010 0b1 0 1 1 0 1 0 ss0018 031 0 1 1 0 1 0 ds0018 031 0 1 1 0 1 0 es0018 031 0 1 1 0 1 0 fs0018 031 0 1 1 0 1 0 gs0018 031 0 1 1 0 1 0 tr0020 1000 0067 0b1 0 0 0 0 0 0 ldt 000 0 0 0 0 0 0 [ efer: 0500 apic base: nmi: disabled ] Interrupt bitmap: Code: 08 49 75 f3 8d 83 00 60 4d 00 0f 22 d8 b9 80 00 00 c0 0f 32 0f ba e8 08 0f 30 6a 10 8d 85 00 02 00 00 50 b8 01 00 00 80 0f 22 c0 cb f4 eb fd 9c 6a 00 9d 9c 58 89 c3 35 00 00 20 00 50 9d 9c 58 Using Linux 'scripts/decodecode', we can see that we are at startup_32() of arch/x86/boot/compressed/head_64.S: All code 0: 08 49 75or %cl,0x75(%rcx) 3: f3 8d 83 00 60 4d 00repz lea 0x4d6000(%rbx),%eax a: 0f 22 d8mov%rax,%cr3 d: b9 80 00 00 c0 mov$0xc080,%ecx 12: 0f 32 rdmsr 14: 0f ba e8 08 bts$0x8,%eax 18: 0f 30 wrmsr 1a: 6a 10 pushq $0x10 1c: 8d 85 00 02 00 00 lea0x200(%rbp),%eax 22: 50 push %rax 23: b8 01 00 00 80 mov$0x8001,%eax 28: 0f 22 c0mov%rax,%cr0 2b:* cb lret-- trapping instruction 2c: f4 hlt 2d: eb fd jmp0x2c 2f: 9c pushfq 30: 6a 00 pushq $0x0 32: 9d popfq 33: 9c pushfq 34: 58 pop%rax 35: 89 c3 mov%eax,%ebx 37: 35 00 00 20 00 xor$0x20,%eax 3c: 50 push %rax 3d: 9d popfq 3e: 9c pushfq 3f: 58 pop%rax I already asked Avi in private about this and he suggested I'd post a register dump to the list. Please note that I am in no way ruling out a bug in our fakebios emulation but my gut feeling is that I'm just missing something obvious in the KVM setup. For those that might be interested, source code to the launcher is available here: git clone git://github.com/penberg/vm.git Launching a Linux kernel is as simple as: make ; ./kvm bzImage Pekka -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with KVM guest switching to x86 long mode
On 04/08/2010 09:26 PM, Pekka Enberg wrote: Hi! I am working on a light-weight KVM userspace launcher for Linux and am bit stuck with a guest Linux kernel restarting when it tries to enter long mode. The register dump looks like this: penb...@tiger:~/vm$ ./kvm bzImage KVM exit reason: 8 (KVM_EXIT_SHUTDOWN) Registers: rip: 001000ed rsp: 005d54b8 flags: 00010046 rax: 8001 rbx: 01f2c000 rcx: c080 rdx: rsi: 00013670 rdi: 02408000 rbp: 0010 r8: r9: r10: r11: r12: r13: r14: r15: cr0: 8011 cr2: 001000ed cr3: 02402000 cr4: 0020 cr8: Segment registers: register selector base limit type p dpl db s l g avl cs0010 0b1 0 1 1 0 1 0 ss0018 031 0 1 1 0 1 0 ds0018 031 0 1 1 0 1 0 es0018 031 0 1 1 0 1 0 fs0018 031 0 1 1 0 1 0 gs0018 031 0 1 1 0 1 0 tr0020 1000 0067 0b1 0 0 0 0 0 0 ldt 000 0 0 0 0 0 0 These all look reasonable. Please add a gdtr dump and an idtr dump. 2b:* cb lret-- trapping instruction Post the two u32s at ss:rsp - ss:rsp+8. That will tell us where the guest is trying to return. Actually, from the dump: 1a:6a 10pushq $0x10 1c:8d 85 00 02 00 00lea0x200(%rbp),%eax 22:50 push %rax it looks like you're returning to segment 0x10, this should be the word at ss:rsp+4. So if you dump the 2 u32s at gdtr.base+0x10..gdtr.base+0x18 we'll see if there's anything wrong with the segment descriptor. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with KVM guest switching to x86 long mode
Avi Kivity wrote: These all look reasonable. Please add a gdtr dump and an idtr dump. Done. 2b:*cb lret-- trapping instruction Post the two u32s at ss:rsp - ss:rsp+8. That will tell us where the guest is trying to return. Actually, from the dump: 1a:6a 10pushq $0x10 1c:8d 85 00 02 00 00lea0x200(%rbp),%eax 22:50 push %rax it looks like you're returning to segment 0x10, this should be the word at ss:rsp+4. So if you dump the 2 u32s at gdtr.base+0x10..gdtr.base+0x18 we'll see if there's anything wrong with the segment descriptor. Here you go: penb...@tiger:~/vm$ ./kvm bzImage KVM exit reason: 8 (KVM_EXIT_SHUTDOWN) Registers: rip: 001000ed rsp: 005d54b8 flags: 00010046 rax: 8001 rbx: 01f2c000 rcx: c080 rdx: rsi: 00013670 rdi: 02408000 rbp: 0010 r8: r9: r10: r11: r12: r13: r14: r15: cr0: 8011 cr2: 001000ed cr3: 02402000 cr4: 0020 cr8: Segment registers: register selector base limit type p dpl db s l g avl cs0010 0b1 0 1 1 0 1 0 ss0018 031 0 1 1 0 1 0 ds0018 031 0 1 1 0 1 0 es0018 031 0 1 1 0 1 0 fs0018 031 0 1 1 0 1 0 gs0018 031 0 1 1 0 1 0 tr0020 1000 0067 0b1 0 0 0 0 0 0 ldt 000 0 0 0 0 0 0 gdt 005ca458 0030 idt [ efer: 0500 apic base: nmi: disabled ] Interrupt bitmap: Code: 08 49 75 f3 8d 83 00 60 4d 00 0f 22 d8 b9 80 00 00 c0 0f 32 0f ba e8 08 0f 30 6a 10 8d 85 00 02 00 00 50 b8 01 00 00 80 0f 22 c0 cb f4 eb fd 9c 6a 00 9d 9c 58 89 c3 35 00 00 20 00 50 9d 9c 58 Stack: 0x005d54b8: 00 02 10 00 10 00 00 00 -- return value 0x005d54c0: 00 00 00 00 00 00 00 00 0x005d54c8: 00 00 00 00 00 00 00 00 0x005d54d0: 00 00 00 00 00 00 00 00 GDT: 0x005ca458: 30 00 58 a4 5c 00 00 00 0x005ca460: 00 00 00 00 00 00 00 00 0x005ca468: ff ff 00 00 00 9a af 00 -- gtr.base + 0x10 0x005ca470: ff ff 00 00 00 92 cf 00 0x005ca478: 00 00 00 00 00 89 80 00 0x005ca480: 00 00 00 00 00 00 00 00 Pekka -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
hugetlbfs and KSM
Hi, running Debian Squeeze with a 2.6.32-3-amd64 kernel and qemu-kvm 0.12.3 I enabled hugetlbfs on a rather small box with about five similar VMs today (all Debian Squeeze amd64, but different services) Pro: * system load on the host has gone way down (by about 50%) Contra: * KSM seems to be largely ineffective (100MB saved - 1.3MB saved) Am I doing something wrong? Is this a bug? Is this generally impossible with large pages (which might explain the lower load on the host, if large pages are not scanned)? Or is it just way less likely to have identical pages at that size? Bernhard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with KVM guest switching to x86 long mode
On 04/08/2010 09:59 PM, Pekka Enberg wrote: 2b:*cb lret-- trapping instruction Post the two u32s at ss:rsp - ss:rsp+8. That will tell us where the guest is trying to return. Actually, from the dump: 1a:6a 10pushq $0x10 1c:8d 85 00 02 00 00lea0x200(%rbp),%eax 22:50 push %rax it looks like you're returning to segment 0x10, this should be the word at ss:rsp+4. So if you dump the 2 u32s at gdtr.base+0x10..gdtr.base+0x18 we'll see if there's anything wrong with the segment descriptor. Here you go: I was asking for the wrong things. penb...@tiger:~/vm$ ./kvm bzImage KVM exit reason: 8 (KVM_EXIT_SHUTDOWN) Registers: rip: 001000ed rsp: 005d54b8 flags: 00010046 rax: 8001 rbx: 01f2c000 rcx: c080 rdx: rsi: 00013670 rdi: 02408000 rbp: 0010 r8: r9: r10: r11: r12: r13: r14: r15: cr0: 8011 cr2: 001000ed cr3: 02402000 cr2 points at rip. So it isn't lret not executing correctly, it's the cpu not able to fetch lret at all. The code again: 23: b8 01 00 00 80 mov$0x8001,%eax 28: 0f 22 c0mov%rax,%cr0 2b:* cb lret-- trapping instruction The instruction at 0x28 is enabling paging, next insn fetch faults, so the paging structures must be incorrect. Questions: - what is the u64 at cr3? (call it pte4) - what is the u64 at (pte4 ~0xfff)? (call it pte3) - what is the u64 at (pte3 ~0xfff)? (pte2) - what is the u64 at ((pte2 ~0xfff) + 2048)? (pte1) Note if bit 7 of pte2 is set, then pte1 is unneeded. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hugetlbfs and KSM
I asked this question quite a while ago, it seems huge pages do not get scanned for merging. David Martin - Bernhard Schmidt be...@birkenwald.de wrote: Hi, running Debian Squeeze with a 2.6.32-3-amd64 kernel and qemu-kvm 0.12.3 I enabled hugetlbfs on a rather small box with about five similar VMs today (all Debian Squeeze amd64, but different services) Pro: * system load on the host has gone way down (by about 50%) Contra: * KSM seems to be largely ineffective (100MB saved - 1.3MB saved) Am I doing something wrong? Is this a bug? Is this generally impossible with large pages (which might explain the lower load on the host, if large pages are not scanned)? Or is it just way less likely to have identical pages at that size? Bernhard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH UNTESTED] KVM: VMX: Save/restore rflags.vm correctly in real mode
On Thu, Apr 08, 2010 at 06:19:35PM +0300, Avi Kivity wrote: Currently we set eflags.vm unconditionally when entering real mode emulation through virtual-8086 mode, and clear it unconditionally when we enter protected mode. The means that the following sequence KVM_SET_REGS (rflags.vm=1) KVM_SET_SREGS (cr0.pe=1) Ends up with rflags.vm clear due to KVM_SET_SREGS triggering enter_pmode(). Fix by shadowing rflags.vm (and rflags.iopl) correctly while in real mode: reads and writes to those bits access a shadow register instead of the actual register. Signed-off-by: Avi Kivity a...@redhat.com Tested and applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm autotest, how to disable address cache
Is there any way to disable this? I'm running a guest on -net user networking, no interaction with the host network, yet, during the test, I get tons of: 15:50:48 DEBUG| (address cache) Adding cache entry: 00:1a:64:39:04:91 --- 10.0.253.16 15:50:49 DEBUG| (address cache) Adding cache entry: e4:1f:13:2c:e5:04 --- 10.0.253.132 many times for the same mapping. If I'm not using tap networking on a public bridge, what's this address cache doing for me? And, how the heck do turn this off? -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx ry...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
On 08/04/10 09:52, Andre Przywara wrote: Can you try to boot the attached multiboot kernel, which just outputs a brief CPUID dump? $ qemu-kvm -kernel cpuid_mb -vnc :0 (Unfortunately I have no serial console support in there yet, so you either have to write the values down or screenshot it). In the 4th line from the button it should print NX (after SYSCALL). OK, that was fun! Resulting screen shots are attached. ...default.png With command line above. ...cpu_host.png With -cpu host option added. ...no_kvm.png With -no-kvm option added. I hope that helps! Richard attachment: cpuid_mb_screendump_cpu_host.pngattachment: cpuid_mb_screendump_default.pngattachment: cpuid_mb_screendump_no_kvm.png
Re: raw disks no longer work in latest kvm (kvm-88 was fine)
Antoine Martin wrote: On 03/08/2010 02:35 AM, Avi Kivity wrote: On 03/07/2010 09:25 PM, Antoine Martin wrote: On 03/08/2010 02:17 AM, Avi Kivity wrote: On 03/07/2010 09:13 PM, Antoine Martin wrote: What version of glibc do you have installed? Latest stable: sys-devel/gcc-4.3.4 sys-libs/glibc-2.10.1-r1 $ git show glibc-2.10~108 | head commit e109c6124fe121618e42ba882e2a0af6e97b8efc Author: Ulrich Drepper drep...@redhat.com Date: Fri Apr 3 19:57:16 2009 + * misc/Makefile (routines): Add preadv, preadv64, pwritev, pwritev64. * misc/Versions: Export preadv, preadv64, pwritev, pwritev64 for GLIBC_2.10. * misc/sys/uio.h: Declare preadv, preadv64, pwritev, pwritev64. * sysdeps/unix/sysv/linux/kernel-features.h: Add entries for preadv You might get away with rebuilding glibc against the 2.6.33 headers. The latest kernel headers available in gentoo (and they're masked unstable): sys-kernel/linux-headers-2.6.32 So I think I will just keep using Christoph's patch until .33 hits portage. Unless there's any reason not to? I would rather keep my system clean. I can try it though, if that helps you clear things up? preadv/pwritev was actually introduced in 2.6.30. Perhaps you last build glibc before that? If so, a rebuild may be all that's necessary. To be certain, I've rebuilt qemu-kvm against: linux-headers-2.6.33 + glibc-2.10.1-r1 (both freshly built) And still no go! I'm still having to use the patch which disables preadv unconditionally... Better late than never, here's the relevant part of the strace (for the unpatched case where it fails): stat(./fs, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 41), ...}) = 0 open(./fs, O_RDWR|O_DIRECT|O_CLOEXEC) = 12 lseek(12, 0, SEEK_END) = 1321851815424 [pid 31266] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31266] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31266] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31266] lseek(12, 0, SEEK_SET) = 0 [pid 31266] read(12, \240\246E\32\r\21\367c\212\316Xn\177e'\310}\234\1\273`\371\266\247\r\1nj\332\32\221\26..., 512) = 512 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, iQ\35 \271O\203vj\ve[Ni}\355\263\272\4#yMo\266.\341\21\340Y5\204\20..., 4096, 1321851805696) = 4096 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31273] pread(12, unfinished ... [pid 31267] lseek(12, 0, SEEK_END) = 1321851815424 [pid 31271] pread(12, unfinished ... [pid 31267] lseek(12,
Re: Setting nx bit in virtual CPU
On 08/04/10 08:23, Avi Kivity wrote: Strange. Can you hack qemu-kvm's cpuid code where it issues the ioctl KVM_SET_CPUID2 to show what the data is? I'm not where that code is in your version of qemu-kvm. Gad, the last time I tried to mess around with this sort of low level code was many years ago when I was a keen young bachelor burning the midnight oil trying to get the weird IDE controller on my Alpha to work properly! Anyway, I have tried to give it a go. I found a file called qemu-kvm-x86.c It contained a function called kvm_setup_cpuid2 which I modified as follows: int kvm_setup_cpuid2(CPUState *env, int nent, struct kvm_cpuid_entry2 *entries) { struct kvm_cpuid2 *cpuid; int r, i; fprintf(stderr, cpuid=nent %d\n, nent); for (i=0; i nent; i++) { fprintf(stderr, %x %x %x %x %x %x %x\n, entries[i].function, entries[i].index, entries[i].flags, entries[i].eax, entries[i].ebx, entries[i].ecx, entries[i].edx); } cpuid = qemu_malloc(sizeof(*cpuid) + nent * sizeof(*entries)); cpuid-nent = nent; memcpy(cpuid-entries, entries, nent * sizeof(*entries)); r = kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid); free(cpuid); return r; } So, basically I go round a loop and print out the contents of each kvm_cpuid_entry2 structure. Results below, using Andre Przywara's handy nano-kernel. I do hope that some of this makes some kind of sense! qemu-kvm -kernel cpuid_mb -vnc :0 cpuid=nent 21 4000 0 0 0 4b4d564b 564b4d56 4d 4001 0 0 7 0 0 0 0 0 0 4 68747541 444d4163 69746e65 1 0 0 623 800 80002001 78bfbfd 2 0 0 1 0 0 2c307d 3 0 0 0 0 0 0 4 0 1 121 1c0003f 3f 1 4 1 1 122 1c0003f 3f 1 4 2 1 143 3c0003f fff 1 4 3 1 0 0 0 0 8000 0 0 800a 68747541 444d4163 69746e65 8001 0 0 623 0 1 2181abfd 8002 0 0 554d4551 72695620 6c617574 55504320 8003 0 0 72657620 6e6f6973 312e3020 332e32 8004 0 0 0 0 0 0 8005 0 0 1ff01ff 1ff01ff 40020140 40020140 8006 0 0 0 42004200 2008140 0 8007 0 0 0 0 0 0 8008 0 0 3028 0 0 0 8009 0 0 0 0 0 0 800a 0 0 1 10 0 0 qemu-kvm -kernel cpuid_mb -cpu host -vnc :0 cpuid=nent 29 4000 0 0 0 4b4d564b 564b4d56 4d 4001 0 0 7 0 0 0 0 0 0 1 68747541 444d4163 69746e65 1 0 0 40ff2 800 80002001 78bfbff 8000 0 0 8018 68747541 444d4163 69746e65 8001 0 0 40ff2 0 1 23c3fbff 8002 0 0 20444d41 6c687441 74286e6f 3620296d 8003 0 0 72502034 7365636f 20726f73 30303233 8004 0 0 2b 0 0 0 8005 0 0 1ff01ff 1ff01ff 40020140 40020140 8006 0 0 0 42004200 2008140 0 8007 0 0 0 0 0 0 8008 0 0 3028 0 0 0 8009 0 0 0 0 0 0 800a 0 0 1 10 0 0 800b 0 0 0 0 0 0 800c 0 0 0 0 0 0 800d 0 0 0 0 0 0 800e 0 0 0 0 0 0 800f 0 0 0 0 0 0 8010 0 0 0 0 0 0 8011 0 0 0 0 0 0 8012 0 0 0 0 0 0 8013 0 0 0 0 0 0 8014 0 0 0 0 0 0 8015 0 0 0 0 0 0 8016 0 0 0 0 0 0 8017 0 0 0 0 0 0 8018 0 0 0 0 0 0 If I try with -no-kvm then nothing gets printed, presumably because this is a kvm specific function and doesn't get called in that case. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote: On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote: On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote: Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. Thanks for looking into this. An alternative approach is to simply replace create_singlethread_workqueue with create_workqueue which would get us a thread per host CPU. It seems that in theory this should be the optimal approach wrt CPU locality, however, in practice a single thread seems to get better numbers. I have a TODO to investigate this. Could you try looking into this? Yes. I tried using create_workqueue(), but the results were not good atleast when the number of guest interfaces is less than the number of CPUs. I didn't try more than 8 guests. Creating a separate thread per guest interface seems to be more scalable based on the testing i have done so far. I will try some more tests and get some numbers to compare the following 3 options. - single vhost thread - vhost thread per cpu - vhost thread per guest virtio interface Here are the results with netperf TCP_STREAM 64K guest to host on a 8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host CPU utilization. Current default single vhost thread --- 1 guest: 12500 37% 2 guests: 12800 46% 3 guests: 12600 47% 4 guests: 12200 47% 5 guests: 12000 47% 6 guests: 11700 47% 7 guests: 11340 47% 8 guests: 11200 48% vhost thread per cpu 1 guest: 4900 25% 2 guests: 10800 49% 3 guests: 17100 67% 4 guests: 20400 84% 5 guests: 21000 90% 6 guests: 22500 92% 7 guests: 23500 96% 8 guests: 24500 99% vhost thread per guest interface 1 guest: 12500 37% 2 guests: 21000 72% 3 guests: 21600 79% 4 guests: 21600 85% 5 guests: 22500 89% 6 guests: 22800 94% 7 guests: 24500 98% 8 guests: 26400 99% Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
Here are the results with netperf TCP_STREAM 64K guest to host on a 8-cpu Nehalem system. I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX? Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a lot!-) but I find it incomplete and also like to run things like single-instance TCP_RR and multiple-instance, multiple transaction (./configure --enable-burst) TCP_RR tests, particularly when concerned with scaling issues. happy benchmarking, rick jones It shows cumulative bandwidth in Mbps and host CPU utilization. Current default single vhost thread --- 1 guest: 12500 37% 2 guests: 12800 46% 3 guests: 12600 47% 4 guests: 12200 47% 5 guests: 12000 47% 6 guests: 11700 47% 7 guests: 11340 47% 8 guests: 11200 48% vhost thread per cpu 1 guest: 4900 25% 2 guests: 10800 49% 3 guests: 17100 67% 4 guests: 20400 84% 5 guests: 21000 90% 6 guests: 22500 92% 7 guests: 23500 96% 8 guests: 24500 99% vhost thread per guest interface 1 guest: 12500 37% 2 guests: 21000 72% 3 guests: 21600 79% 4 guests: 21600 85% 5 guests: 22500 89% 6 guests: 22800 94% 7 guests: 24500 98% 8 guests: 26400 99% Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
On Tue, 6 Apr 2010 14:26:29 +0800 Xin, Xiaohui xiaohui@intel.com wrote: How do you deal with the DoS problem of hostile user space app posting huge number of receives and never getting anything. That's a problem we are trying to deal with. It's critical for long term. Currently, we tried to limit the pages it can pin, but not sure how much is reasonable. For now, the buffers submitted is from guest virtio-net driver, so it's safe in some extent just for now. It is critical even now. Once you get past toy benchmarks you will see things like Java processes with 1000 threads all reading at once. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html