Re: Shouldn't cache=none be the default for drives?

2010-04-08 Thread Michael Tokarev

08.04.2010 09:07, Thomas Mueller wrote:
[]

This helped alot:

I enabled deadline block scheduler instead of the default cfq on the
host system. tested with: Host Debian with scheduler deadline, Guest
Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost measured)
Maybe this is also true for Linux/Linux.

I expect that scheduler noop for linux guests would be good.


Hmm.   I wonder why it helped.  In theory, host scheduler should not
change anything for cache=none case, at least for raw partitions of
LVM volumes.  This is because with cache=none, the virtual disk
image is opened with O_DIRECT flag, which means all I/O bypasses
host scheduler and buffer cache.

I tried a few quick tests here, -- with LVM volumes it makes no
measurable difference.  But if the guest disk images are on
plain files (also raw), scheduler makes some difference, and
indeed deadline works better.  Maybe you were testing with
plain files instead of block devices?

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-08 Thread Thomas Mueller
Am Thu, 08 Apr 2010 10:05:09 +0400 schrieb Michael Tokarev:

 08.04.2010 09:07, Thomas Mueller wrote: []
 This helped alot:

 I enabled deadline block scheduler instead of the default cfq on
 the host system. tested with: Host Debian with scheduler deadline,
 Guest Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost
 measured) Maybe this is also true for Linux/Linux.

 I expect that scheduler noop for linux guests would be good.
 
 Hmm.   I wonder why it helped.  In theory, host scheduler should not
 change anything for cache=none case, at least for raw partitions of LVM
 volumes.  This is because with cache=none, the virtual disk image is
 opened with O_DIRECT flag, which means all I/O bypasses host scheduler
 and buffer cache.
 
 I tried a few quick tests here, -- with LVM volumes it makes no
 measurable difference.  But if the guest disk images are on plain files
 (also raw), scheduler makes some difference, and indeed deadline works
 better.  Maybe you were testing with plain files instead of block
 devices?

ah yes, qcow2 images. 

- Thomas

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

Gleb Natapov wrote:

On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:

Avi Kivity wrote:

On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:


The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices. Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I
expected.
I tracked down this issue, and figured out rip was already proceeded
in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to
solve
this rip issue before that. If there is no drawback, I'm happy to work
and post a patch.


vcpu state is undefined when an mmio operation is pending,
Documentation/kvm/api.txt says the following:


NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after
userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals. Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.


Thanks for the information.

So the point is the vcpu state that can been observed from qemu upon
KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used
because it's not complete/consistent?


Definitely. VCPU is in the middle of an instruction execution, so the
state is undefined. One instruction may generate more then one IO exit
during its execution BTW.


Regarding the multiple IO exits, we're paying attention too.  Although it 
depends on the guest behavior, if we limit the device model, one IO exit per one 
instruction may be practical at beggining.  But thanks for pointing out.


To solve the undefined VCPU state, how about keeping a copy of initial state 
upon VMEXIT?  I guess there already is a similar shadow state in KVM.  If 
possible we can allocate another one for this purpose.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-08 Thread Thomas Mueller
Am Thu, 08 Apr 2010 06:09:05 + schrieb Thomas Mueller:

 Am Thu, 08 Apr 2010 10:05:09 +0400 schrieb Michael Tokarev:
 
 08.04.2010 09:07, Thomas Mueller wrote: []
 This helped alot:

 I enabled deadline block scheduler instead of the default cfq on
 the host system. tested with: Host Debian with scheduler deadline,
 Guest Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost
 measured) Maybe this is also true for Linux/Linux.

 I expect that scheduler noop for linux guests would be good.
 
 Hmm.   I wonder why it helped.  In theory, host scheduler should not
 change anything for cache=none case, at least for raw partitions of LVM
 volumes.  This is because with cache=none, the virtual disk image is
 opened with O_DIRECT flag, which means all I/O bypasses host scheduler
 and buffer cache.
 
 I tried a few quick tests here, -- with LVM volumes it makes no
 measurable difference.  But if the guest disk images are on plain files
 (also raw), scheduler makes some difference, and indeed deadline works
 better.  Maybe you were testing with plain files instead of block
 devices?
 
 ah yes, qcow2 images.

... but does the scheduler really now about O_DIRECT? isn't O_DIRECT 
meant to bypass only buffers (aka return write not before it really hit 
the disk)? my understanding is that the scheduler is layer down the 
stack. but only guessing - i'm not a kernel hacker. :)

- Thomas  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Gleb Natapov
On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:
 Currently we complete instructions for output operations and leave them
 incomplete for input operations. Deferring completion for output
 operations should work, except it may break the vmware backdoor port
 (see hw/vmport.c), which changes register state following an output
 instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state
 following a write instruction.
 
 Do you really need to transfer the vcpu state before the instruction, or
 do you just need a consistent state? If the latter, then you can get
 away by posting a signal and re-entering the guest. kvm will complete
 the instruction and exit immediately, and you will have fully consistent
 state.
 
 The requirement is that the guest must always be able to replay at
 least the instruction which triggered the synchronization on the
 primary.  From that point of view, I think I need to transfer the
 vcpu state before the instruction.  If I post a signal and let the
 guest or emulator proceed, I'm not sure whether the guest on the
 secondary can be replay as expected.  Please point out if I were
 misunderstanding.
All you need is some consistent sate to restart VM from, no? So if you
transfer VM state after instruction that caused IO is completed you can
restart VM on secondary from that state in case primary fails. I guess
my question is: Can you make synchronization point to be immediately after
IO instruction instead of before?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity

On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote:


The requirement is that the guest must always be able to replay at 
least the instruction which triggered the synchronization on the primary.



You have two choices:

 - complete execution of the instruction in both the kernel and the 
device model


This is what live migration does currently.  Any mmio and pio requests 
are completed, the last instruction is finalized, and state is saved.


 - complete execution of the instruction in the kernel, but queue 
execution of mmio/pio requests


This is more in line with what you describe.  vcpu state will be after 
the instruction, device model state will be before instruction 
completion, when you replay the queue, the device model state will be 
consistent with the vcpu state.


  From that point of view, I think I need to transfer the vcpu state 
before the instruction.  If I post a signal and let the guest or 
emulator proceed, I'm not sure whether the guest on the secondary can 
be replay as expected.  Please point out if I were misunderstanding.


If the responses to the mmio or pio request are exactly the same, then 
the replay will happen exactly the same.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Gleb Natapov
On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote:
 On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote:
 
 The requirement is that the guest must always be able to replay at
 least the instruction which triggered the synchronization on the
 primary.
 
 
 You have two choices:
 
  - complete execution of the instruction in both the kernel and the
 device model
 
 This is what live migration does currently.  Any mmio and pio
 requests are completed, the last instruction is finalized, and state
 is saved.
 
  - complete execution of the instruction in the kernel, but queue
 execution of mmio/pio requests
 
 This is more in line with what you describe.  vcpu state will be
 after the instruction, device model state will be before instruction
 completion, when you replay the queue, the device model state will
 be consistent with the vcpu state.
 
For in or mmio read you can't complete instruction without doing
actual IO.

   From that point of view, I think I need to transfer the vcpu
 state before the instruction.  If I post a signal and let the
 guest or emulator proceed, I'm not sure whether the guest on the
 secondary can be replay as expected.  Please point out if I were
 misunderstanding.
 
 If the responses to the mmio or pio request are exactly the same,
 then the replay will happen exactly the same.
 
 -- 
 error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Jan Kiszka
Avi Kivity wrote:
 On 04/07/2010 11:24 PM, Marcelo Tosatti wrote:
 During initialization, WinXP.32 switches to virtual-8086 mode, with
 paging enabled, to use VGABIOS functions.

 Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS

  flags = vmcs_readl(GUEST_RFLAGS);
  flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM);
  flags |= (vmx-rmode.save_iopl  IOPL_SHIFT);
  vmcs_writel(GUEST_RFLAGS, flags);


 
 
 Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?

Just like we manipulate the flags for guest debugging in the
set/get_rflags vendor handlers, the same should happen for IOPL and VM.
This is no business of enter_pmode/rmode.

 
 I think we have a small related bug in realmode emulation - we run the 
 guest with iopl=3.  This means the guest can use pushfl and see the host 
 iopl instead of the guest iopl.  We should run with iopl=0, which causes 
 pushfl/popfl to #GP, where we can emulate the flags correctly (by 
 updating rmode.save_iopl and rmode.save_vm).  That has lots of 
 implications however...
 
 
 And the order of loading state is set_regs (rflags) followed by
 set_sregs (cr0), these bits are lost across save/restore:

 savevm 1
 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=33286
 system_reset
 loadvm 1
 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=10286
 cont
 kvm: unhandled exit 8021
 kvm_run returned -22

 The following patch fixes it, but it has some drawbacks:

 - cpu_synchronize_state+writeback is noticeably slow with tpr patching,
this makes it slower.

 
 Isn't it a very rare event?

It has to be - otherwise the decision to go for full sync and individual
get/set IOCTL would have been wrong. What happens during tpr patching?

 
 - Should be conditional on VMX !unrestricted guest.

 
 Userspace should know nothing of this mess.
 
 - Its a fugly workaround.

 
 True.
 

Still likely the way to go for old kernels.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-08 Thread Avi Kivity

On 04/08/2010 02:13 AM, Richard Simpson wrote:
   

gordon Code # ./check-nx
nx: enabled
gordon Code #

OK, seems to be enabled just fine.  Any other ideas?  I am beginning to
get that horrible feeling that there isn't a real problem and it is just
me being dumb!

   

I really hope so, because I am out of ideas... :)

Can you verify check-nx returns disabled on the guest?
Does /proc/cpuinfo show nx in the guest?

 

OK, time for a summary:

Host:  /proc/cpuinfo shows 'nx' and check-nx shows 'enabled'

Guest: /proc/cpuinfo doesn't show nx and check-nx shows 'disabled'
   


Strange.  Can you hack qemu-kvm's cpuid code where it issues the ioctl 
KVM_SET_CPUID2 to show what the data is?  I'm not where that code is in 
your version of qemu-kvm.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Avi Kivity

On 04/08/2010 10:22 AM, Jan Kiszka wrote:

Avi Kivity wrote:
   

On 04/07/2010 11:24 PM, Marcelo Tosatti wrote:
 

During initialization, WinXP.32 switches to virtual-8086 mode, with
paging enabled, to use VGABIOS functions.

Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS

  flags = vmcs_readl(GUEST_RFLAGS);
  flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM);
  flags |= (vmx-rmode.save_iopl   IOPL_SHIFT);
  vmcs_writel(GUEST_RFLAGS, flags);


   


Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?
 

Just like we manipulate the flags for guest debugging in the
set/get_rflags vendor handlers, the same should happen for IOPL and VM.
This is no business of enter_pmode/rmode.
   


This is vendor specific code, and it isn't manipulating guest values, 
only host values (-set_rflags() is called when the guest value changes, 
which isn't happening here).  Of course some refactoring will be helpful 
here.



The following patch fixes it, but it has some drawbacks:

- cpu_synchronize_state+writeback is noticeably slow with tpr patching,
this makes it slower.

   

Isn't it a very rare event?
 

It has to be - otherwise the decision to go for full sync and individual
get/set IOCTL would have been wrong. What happens during tpr patching?

   


tpr patching listens for instructions which access the tpr and patches 
them to a call instruction (targeting some hacky code in the bios).  
Since there are a limited number of such instructions (20-30 IIRC) you 
expect tpr patching to happen very rarely.



- Its a fugly workaround.

   

True.

 

Still likely the way to go for old kernels.

   


It's a bugfix that can go into -stable and supported distribution kernels.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

Gleb Natapov wrote:

On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:

Currently we complete instructions for output operations and leave them
incomplete for input operations. Deferring completion for output
operations should work, except it may break the vmware backdoor port
(see hw/vmport.c), which changes register state following an output
instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state
following a write instruction.

Do you really need to transfer the vcpu state before the instruction, or
do you just need a consistent state? If the latter, then you can get
away by posting a signal and re-entering the guest. kvm will complete
the instruction and exit immediately, and you will have fully consistent
state.


The requirement is that the guest must always be able to replay at
least the instruction which triggered the synchronization on the
primary.  From that point of view, I think I need to transfer the
vcpu state before the instruction.  If I post a signal and let the
guest or emulator proceed, I'm not sure whether the guest on the
secondary can be replay as expected.  Please point out if I were
misunderstanding.

All you need is some consistent sate to restart VM from, no? So if you
transfer VM state after instruction that caused IO is completed you can
restart VM on secondary from that state in case primary fails. I guess
my question is: Can you make synchronization point to be immediately after
IO instruction instead of before?


To answer your question, it should be possible to implement.
The down side is that after going into KVM to make the guest state to 
consistent, we need to go back to qemu to actually transfer the guest, and this 
bounce would introduce another overhead if I'm understanding correctly.


And yes, all I need is some consistent state to resume VM from, which must be 
able to continue I/O operations, like writing to disks and sending ack over the 
network.  If I can guarantee this, sending the VM state after completing output 
is acceptable.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity

On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote:


To answer your question, it should be possible to implement.
The down side is that after going into KVM to make the guest state to 
consistent, we need to go back to qemu to actually transfer the guest, 
and this bounce would introduce another overhead if I'm understanding 
correctly.


Yes.  It should be around a microsecond or so, given you will issue I/O 
after this I don't think this will affect performance.


And yes, all I need is some consistent state to resume VM from, which 
must be able to continue I/O operations, like writing to disks and 
sending ack over the network.  If I can guarantee this, sending the VM 
state after completing output is acceptable.




I suggest you start with this.  If it turns out performance is severely 
impacted, we can revisit instruction completion.  If performance is 
satisfactory, then we'll be able to run Kemari with older kernels.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Jan Kiszka
Avi Kivity wrote:
 On 04/08/2010 10:22 AM, Jan Kiszka wrote:
 Avi Kivity wrote:

 On 04/07/2010 11:24 PM, Marcelo Tosatti wrote:
  
 During initialization, WinXP.32 switches to virtual-8086 mode, with
 paging enabled, to use VGABIOS functions.

 Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS

   flags = vmcs_readl(GUEST_RFLAGS);
   flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM);
   flags |= (vmx-rmode.save_iopl   IOPL_SHIFT);
   vmcs_writel(GUEST_RFLAGS, flags);



 Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?
  
 Just like we manipulate the flags for guest debugging in the
 set/get_rflags vendor handlers, the same should happen for IOPL and VM.
 This is no business of enter_pmode/rmode.

 
 This is vendor specific code, and it isn't manipulating guest values, 
 only host values (-set_rflags() is called when the guest value changes, 
 which isn't happening here).  Of course some refactoring will be helpful 
 here.

Actually, the bug is that enter_pmode/rmode update save_iopl (and that
no one saves the VM bit). That should happen in vmx_set_rflags to also
keep track of changes _while_ we are in rmode. enter_rmode/pmode should
just trigger a set_rflags to update things. And vmx_get_rflags must
properly inject the saved flags instead of masking them out.

 
 The following patch fixes it, but it has some drawbacks:

 - cpu_synchronize_state+writeback is noticeably slow with tpr patching,
 this makes it slower.


 Isn't it a very rare event?
  
 It has to be - otherwise the decision to go for full sync and individual
 get/set IOCTL would have been wrong. What happens during tpr patching?


 
 tpr patching listens for instructions which access the tpr and patches 
 them to a call instruction (targeting some hacky code in the bios).  
 Since there are a limited number of such instructions (20-30 IIRC) you 
 expect tpr patching to happen very rarely.

Then I wonder why it is noticeable.

 
 - Its a fugly workaround.


 True.

  
 Still likely the way to go for old kernels.


 
 It's a bugfix that can go into -stable and supported distribution kernels.

Well, would be happy to throw out tones of workaround based on this
approach. :)

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Avi Kivity

On 04/08/2010 10:54 AM, Jan Kiszka wrote:


   

Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?

 

Just like we manipulate the flags for guest debugging in the
set/get_rflags vendor handlers, the same should happen for IOPL and VM.
This is no business of enter_pmode/rmode.

   

This is vendor specific code, and it isn't manipulating guest values,
only host values (-set_rflags() is called when the guest value changes,
which isn't happening here).  Of course some refactoring will be helpful
here.
 

Actually, the bug is that enter_pmode/rmode update save_iopl (and that
no one saves the VM bit). That should happen in vmx_set_rflags to also
keep track of changes _while_ we are in rmode.


Exactly - that's what I suggested above.


enter_rmode/pmode should
just trigger a set_rflags to update things.


Not what I had in mind, but a valid implementation.


And vmx_get_rflags must
properly inject the saved flags instead of masking them out.
   


Yes.  No one ever bothers to play with iopl in real mode, so we never 
noticed this.  We do this for cr0 for example.



It's a bugfix that can go into -stable and supported distribution kernels.
 

Well, would be happy to throw out tones of workaround based on this
approach. :)
   


And I'll be happy to apply such patches.  Just ensure that 2.6.32.y and 
above have the fixes so we don't introduce regressions (I think most 
workarounds are a lot older).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

Gleb Natapov wrote:

On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote:

On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote:


The requirement is that the guest must always be able to replay at
least the instruction which triggered the synchronization on the
primary.



You have two choices:

  - complete execution of the instruction in both the kernel and the
device model

This is what live migration does currently.  Any mmio and pio
requests are completed, the last instruction is finalized, and state
is saved.

  - complete execution of the instruction in the kernel, but queue
execution of mmio/pio requests

This is more in line with what you describe.  vcpu state will be
after the instruction, device model state will be before instruction
completion, when you replay the queue, the device model state will
be consistent with the vcpu state.


For in or mmio read you can't complete instruction without doing
actual IO.


So, if the mmio/pio requests in the queue are only out or mmio write Avi's 
suggestion No.2 would work. But if in or mmio read are mixed with these, (We 
don't have to think if the queue is filled with only in or mmio read because 
we're currently transferring only in case of out or mmio write)

the story gets complicated.


   From that point of view, I think I need to transfer the vcpu
state before the instruction.  If I post a signal and let the
guest or emulator proceed, I'm not sure whether the guest on the
secondary can be replay as expected.  Please point out if I were
misunderstanding.


If the responses to the mmio or pio request are exactly the same,
then the replay will happen exactly the same.


I agree.  What I'm wondering is how can we guarantee that the responses are the 
same...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

Avi Kivity wrote:

On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote:


To answer your question, it should be possible to implement.
The down side is that after going into KVM to make the guest state to
consistent, we need to go back to qemu to actually transfer the guest,
and this bounce would introduce another overhead if I'm understanding
correctly.


Yes. It should be around a microsecond or so, given you will issue I/O
after this I don't think this will affect performance.


That is a good news.


And yes, all I need is some consistent state to resume VM from, which
must be able to continue I/O operations, like writing to disks and
sending ack over the network. If I can guarantee this, sending the VM
state after completing output is acceptable.



I suggest you start with this. If it turns out performance is severely
impacted, we can revisit instruction completion. If performance is
satisfactory, then we'll be able to run Kemari with older kernels.


I was almost to say yes here, but let me ask one more question.
BTW, thank you two for taking time for this discussion which isn't a topic on 
KVM itself.


If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to 
the client, and if a hardware failure occurred to the primary during the VM 
transferring *but the client received the TCP ACK*, the secondary will resume 
from the previous state, and it may need to receive some data from the client. 
However, because the client has already receiver TCP ACK, it won't resend the 
data to the secondary.  It looks this data is going to be dropped.  Am I missing 
some point here?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity

On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote:


If I transferred a VM after I/O operations, let's say the VM sent an 
TCP ACK to the client, and if a hardware failure occurred to the 
primary during the VM transferring *but the client received the TCP 
ACK*, the secondary will resume from the previous state, and it may 
need to receive some data from the client. However, because the client 
has already receiver TCP ACK, it won't resend the data to the 
secondary.  It looks this data is going to be dropped.  Am I missing 
some point here?




I think you should block I/O not at the cpu/device boundary (that's 
inefficient as many cpu I/O instructions don't necessarily cause 
externally visible I/O) but at the device level.  Whenever the network 
device wants to send out a packet, halt the guest (letting any I/O 
instructions complete), synchronize the secondary, and then release the 
pending I/O.  This ensures that the secondary has all of the data prior 
to the ack being sent out.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity

On 04/08/2010 11:10 AM, Yoshiaki Tamura wrote:

If the responses to the mmio or pio request are exactly the same,
then the replay will happen exactly the same.



I agree.  What I'm wondering is how can we guarantee that the 
responses are the same...


I don't think you can in the general case.  But if you gate output at 
the device level, instead of the instruction level, the problem goes 
away, no?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-08 Thread Michael S. Tsirkin
On Wed, Apr 07, 2010 at 02:07:18PM -0700, David Stevens wrote:
 kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM:
 
  On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:

Thanks!
There's some whitespace damage, are you sending with your new
sendmail setup? It seems to have worked for qemu patches ...
   
   Yes, I saw some line wraps in what I received, but I checked
   the original draft to be sure and they weren't there. Possibly from
   the relay; Sigh.
   
   
 @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
/* TODO: Check specific error and bomb out unless ENOBUFS? 
 */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
 - vhost_discard_vq_desc(vq);
 - tx_poll_start(net, sock);
 + if (err == -EAGAIN) {
 +vhost_discard_desc(vq, 1);
 +tx_poll_start(net, sock);
 + } else {
 +vq_err(vq, sendmsg: errno %d\n, -err);
 +/* drop packet; do not discard/resend */
 +vhost_add_used_and_signal(net-dev, vq, head,
 +   0);

vhost does not currently has a consistent error handling strategy:
if we drop packets, need to think which other errors should cause
packet drops.  I prefer to just call vq_err for now,
and have us look at handling segfaults etc in a consistent way
separately.
   
   I had to add this to avoid an infinite loop when I wrote a bad
   packet on the socket. I agree error handling needs a better look,
   but retrying a bad packet continuously while dumping in the log
   is what it was doing when I hit an error before this code. Isn't
   this better, at least until a second look?
   
  
  Hmm, what do you mean 'continuously'? Don't we only try again
  on next kick?
 
 If the packet is corrupt (in my case, a missing vnet header
 during testing), every send will fail and we never make progress.
 I had thousands of error messages in the log (for the same packet)
 before I added this code.

Hmm, we do not want a buggy guest to be able to fill
host logs. This is only if debugging is enabled though, right?
We can also rate-limit the errors.

 If the problem is with the packet,
 retrying the same one as the original code does will never recover.
 This isn't required for mergeable rx buffer support, so I
 can certainly remove it from this patch, but I think the original
 error handling doesn't handle a single corrupted packet very
 gracefully.
 

An approach I considered was to have qemu poll vq_err fd
and stop the device when an error is seen. My concern with
dropping a tx packet is that it would make debugging
very difficult.


 @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
 vq_log = unlikely(vhost_has_feature(net-dev, 
 VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
 -   for (;;) {
 -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -ARRAY_SIZE(vq-iov),
 -out, in,
 -vq_log, log);
 +   while ((datalen = vhost_head_len(sock-sk))) {
 +  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
 +vq_log, log);

This looks like a bug, I think we need to pass
datalen + header size to vhost_get_desc_n.
Not sure how we know the header size that backend will use though.
Maybe just look at our features.
   
   Yes; we have hdr_size, so I can add it here. It'll be 0 for
   the cases where the backend and guest both have vnet header (either
   the regular or larger mergeable buffers one), but should be added
   in for future raw socket support.
  
  So hdr_size is the wrong thing to add then.
  We need to add non-zero value for tap now.
 
 datalen includes the vnet_hdr in the tap case, so we don't need
 a non-zero hdr_size. The socket data has the entire packet and vnet_hdr
 and that length is what we're getting from vhost_head_len().

I only see vhost_head_len returning skb-len. You are sure skb-len
includes vnet_hdr for tap rx?

  

/* OK, now we need to know about added descriptors. */
 -  if (head == vq-num) {
 - if (unlikely(vhost_enable_notify(vq))) {
 +  if (!headcount) {
 + if (retries == 0  unlikely(vhost_enable_notify(vq))) {
  /* They have slipped one in as we were
   * doing that: check again. */
  vhost_disable_notify(vq);
 +retries++;
  continue;
   }

Hmm. The reason we have the code at all, as the comment says, is 
 because
guest could have added more buffers between the time we read last 
 index
and the time we enabled notification. So if we just break like this
the race still exists. We could remember the
last head value we observed, and have 

Re: Setting nx bit in virtual CPU

2010-04-08 Thread Andre Przywara

Avi Kivity wrote:

On 04/07/2010 11:38 PM, Richard Simpson wrote:

On 07/04/10 13:23, Avi Kivity wrote:

Run as root, please.  And check first that you have a file named
/dev/cpu/0/msr.
 

Doh!

gordon Code # ./check-nx
nx: enabled
gordon Code #

OK, seems to be enabled just fine.  Any other ideas?  I am beginning to
get that horrible feeling that there isn't a real problem and it is just
me being dumb!
   


I really hope so, because I am out of ideas... :)

Can you verify check-nx returns disabled on the guest?
Does /proc/cpuinfo show nx in the guest?


Can you try to boot the attached multiboot kernel, which just outputs 
a brief CPUID dump?

$ qemu-kvm -kernel cpuid_mb -vnc :0
(Unfortunately I have no serial console support in there yet, so you 
either have to write the values down or screenshot it).

In the 4th line from the button it should print NX (after SYSCALL).

Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


cpuid_mb
Description: Binary data


Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-08 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

---
Michael,
This is a small patch for the write logging issue with async queue.
I have made a __vhost_get_vq_desc() func which may compute the log
info with any valid buffer index. The __vhost_get_vq_desc() is 
coming from the code in vq_get_vq_desc().
And I use it to recompute the log info when logging is enabled.

Thanks
Xiaohui

 drivers/vhost/net.c   |   27 ---
 drivers/vhost/vhost.c |  115 -
 drivers/vhost/vhost.h |5 ++
 3 files changed, 90 insertions(+), 57 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2aafd90..00a45ef 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -115,7 +115,8 @@ static void handle_async_rx_events_notify(struct vhost_net 
*net,
struct kiocb *iocb = NULL;
struct vhost_log *vq_log = NULL;
int rx_total_len = 0;
-   int log, size;
+   unsigned int head, log, in, out;
+   int size;
 
if (vq-link_state != VHOST_VQ_LINK_ASYNC)
return;
@@ -130,14 +131,25 @@ static void handle_async_rx_events_notify(struct 
vhost_net *net,
iocb-ki_pos, iocb-ki_nbytes);
log = (int)iocb-ki_user_data;
size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
rx_total_len += iocb-ki_nbytes;
 
if (iocb-ki_dtor)
iocb-ki_dtor(iocb);
kmem_cache_free(net-cache, iocb);
 
-   if (unlikely(vq_log))
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(net-dev, vq, vq-iov,
+   ARRAY_SIZE(vq-iov),
+   out, in, vq_log,
+   log, head);
vhost_log_write(vq, vq_log, log, size);
+   }
if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
@@ -313,14 +325,13 @@ static void handle_rx(struct vhost_net *net)
vhost_disable_notify(vq);
hdr_size = vq-hdr_size;
 
-   /* In async cases, for write logging, the simple way is to get
-* the log info always, and really logging is decided later.
-* Thus, when logging enabled, we can get log, and when logging
-* disabled, we can get log disabled accordingly.
+   /* In async cases, when write log is enabled, in case the submitted
+* buffers did not get log info before the log enabling, so we'd
+* better recompute the log info when needed. We do this in
+* handle_async_rx_events_notify().
 */
 
-   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) |
-   (vq-link_state == VHOST_VQ_LINK_ASYNC) ?
+   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
handle_async_rx_events_notify(net, vq);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 97233d5..53dab80 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -715,66 +715,21 @@ static unsigned get_indirect(struct vhost_dev *dev, 
struct vhost_virtqueue *vq,
return 0;
 }
 
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access.  Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq-num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
   struct iovec iov[], unsigned int iov_size,
   unsigned int *out_num, unsigned int *in_num,
-  struct vhost_log *log, unsigned int *log_num)
+  struct vhost_log *log, unsigned int *log_num,
+  unsigned int head)
 {
struct vring_desc desc;
-   unsigned int i, head, found = 0;
-   u16 last_avail_idx;
+   unsigned int i = head, found = 0;
int ret;
 
-   /* Check it isn't doing very strange things with descriptor numbers. */
-   last_avail_idx = vq-last_avail_idx;
-   if (get_user(vq-avail_idx, vq-avail-idx)) {
-   vq_err(vq, Failed to access avail idx at %p\n,
-  vq-avail-idx);
-   return 

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

Avi Kivity wrote:
 On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote:

 If I transferred a VM after I/O operations, let's say the VM sent an
 TCP ACK to the client, and if a hardware failure occurred to the
 primary during the VM transferring *but the client received the TCP
 ACK*, the secondary will resume from the previous state, and it may
 need to receive some data from the client. However, because the client
 has already receiver TCP ACK, it won't resend the data to the
 secondary. It looks this data is going to be dropped. Am I missing
 some point here?


 I think you should block I/O not at the cpu/device boundary (that's
 inefficient as many cpu I/O instructions don't necessarily cause
 externally visible I/O) but at the device level. Whenever the network
 device wants to send out a packet, halt the guest (letting any I/O
 instructions complete), synchronize the secondary, and then release the
 pending I/O. This ensures that the secondary has all of the data prior
 to the ack being sent out.

Although I was thinking to clean up my current code, maybe I should post the 
current status for explanation now.  As you mentioned, I'm capturing I/O at the 
device level, by inserting a hook inside of PIO/MMIO handler in virtio-blk, 
virtio-net and e1000 emulator.  Since it's implemented naively, it'll stop 
(meaning I/O instructions will be delayed) until transferring the VM is done.

So what I can do here is,

1. Let I/O instructions to complete both at qemu and kvm.
2. Transfer the guest state.
# VCPU and device model thinks I/O emulation is already done.
3. Finally release the pending output to the real world.


If the responses to the mmio or pio request are exactly the same,
then the replay will happen exactly the same.



I agree. What I'm wondering is how can we guarantee that the responses
are the same...


I don't think you can in the general case. But if you gate output at the
device level, instead of the instruction level, the problem goes away, no?


Yes, it should.
To implement this, we need to make No.3 to be called asynchronously.  If qemu is 
already handling I/O asynchronously, it would be relatively easy to make this.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-08 Thread Christoph Hellwig
On Thu, Apr 08, 2010 at 10:05:09AM +0400, Michael Tokarev wrote:
 LVM volumes.  This is because with cache=none, the virtual disk
 image is opened with O_DIRECT flag, which means all I/O bypasses
 host scheduler and buffer cache.

O_DIRECT does not bypass the I/O scheduler, only the page cache.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity

On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote:

I don't think you can in the general case. But if you gate output at the
device level, instead of the instruction level, the problem goes 
away, no?


Yes, it should.
To implement this, we need to make No.3 to be called asynchronously.  
If qemu is already handling I/O asynchronously, it would be relatively 
easy to make this.


Yes, you can release the I/O from the iothread instead of the vcpu 
thread.  You can make virtio_net_handle_tx() disable virtio 
notifications and initiate state sync and return, when state sync 
continues you can call the original virtio_net_handle_tx().  If the 
secondary takes over, it needs to call the original 
virtio_net_handle_tx() as well.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura
2010/4/8 Avi Kivity a...@redhat.com:
 On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote:

 I don't think you can in the general case. But if you gate output at the
 device level, instead of the instruction level, the problem goes away,
 no?

 Yes, it should.
 To implement this, we need to make No.3 to be called asynchronously.  If
 qemu is already handling I/O asynchronously, it would be relatively easy to
 make this.

 Yes, you can release the I/O from the iothread instead of the vcpu thread.
  You can make virtio_net_handle_tx() disable virtio notifications and
 initiate state sync and return, when state sync continues you can call the
 original virtio_net_handle_tx().  If the secondary takes over, it needs to
 call the original virtio_net_handle_tx() as well.

Agreed.  Let me try it.
Meanwhile, I'll post what I have done including the hack preventing
rip to proceed.
I would appreciate if you could comment on that too, to keep things in
a good direction.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity

On 04/08/2010 04:42 PM, Yoshiaki Tamura wrote:



Yes, you can release the I/O from the iothread instead of the vcpu thread.
  You can make virtio_net_handle_tx() disable virtio notifications and
initiate state sync and return, when state sync continues you can call the
original virtio_net_handle_tx().  If the secondary takes over, it needs to
call the original virtio_net_handle_tx() as well.
 

Agreed.  Let me try it.
Meanwhile, I'll post what I have done including the hack preventing
rip to proceed.
I would appreciate if you could comment on that too, to keep things in
a good direction.
   


Certainly.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM Test: Add function run_autotest_background and wait_autotest_background.

2010-04-08 Thread Michael Goldish
On 04/07/2010 11:49 AM, Feng Yang wrote:
 Add function run_autotest_background and wait_autotest_background to
 kvm_test_utils.py.  This two functions is used in ioquit test script.
 
 Signed-off-by: Feng Yang fy...@redhat.com
 ---
  client/tests/kvm/kvm_test_utils.py |   68 
 +++-
  1 files changed, 67 insertions(+), 1 deletions(-)
 
 diff --git a/client/tests/kvm/kvm_test_utils.py 
 b/client/tests/kvm/kvm_test_utils.py
 index f512044..2a1054e 100644
 --- a/client/tests/kvm/kvm_test_utils.py
 +++ b/client/tests/kvm/kvm_test_utils.py
 @@ -21,7 +21,7 @@ More specifically:
  @copyright: 2008-2009 Red Hat Inc.
  
  
 -import time, os, logging, re, commands
 +import time, os, logging, re, commands, sys
  from autotest_lib.client.common_lib import error
  from autotest_lib.client.bin import utils
  import kvm_utils, kvm_vm, kvm_subprocess, scan_results
 @@ -402,3 +402,69 @@ def run_autotest(vm, session, control_path, timeout, 
 test_name, outputdir):
  result = bad_results[0]
  raise error.TestFail(Test '%s' ended with %s (reason: '%s')
   % (result[0], result[1], result[3]))
 +
 +
 +def run_autotest_background(vm, session, control_path, timeout, test_name,
 +outputdir):
 +
 +Wrapper of run_autotest() and make it run in the background through 
 fork()
 +and let it run in the child process.
 +1) Flush the stdio.
 +2) Build test params which is recevied from arguments and used by
 +   run_autotest()
 +3) Fork the process and let the run_autotest() run in the child
 +4) Catch the exception raise by run_autotest() and exit the child with
 +   non-zero return code.
 +5) If no exception catched, reutrn 0
 +
 +@param vm: VM object.
 +@param session: A shell session on the VM provided.
 +@param control: An autotest control file.
 +@param timeout: Timeout under which the autotest test must complete.
 +@param test_name: Autotest client test name.
 +@param outputdir: Path on host where we should copy the guest autotest
 +results to.
 +
 +
 +def flush():
 +sys.stdout.flush()
 +sys.stderr.flush()
 +
 +logging.info(Running autotest background ...)
 +flush()
 +pid = os.fork()
 +if pid:
 +# Parent process
 +return pid
 +
 +try:
 +# Launch autotest
 +logging.info(child process of run_autotest_background)
 +run_autotest(vm, session, control_path, timeout, test_name, 
 outputdir)
 +except error.TestFail, message_fail:
 +logging.info([Autotest Background FAIL] %s % message_fail)
 +os._exit(1)
 +except error.TestError, message_error:
 +logging.info([Autotest Background ERROR] %s % message_error)
 +os._exit(2)
 +except:
 +os._exit(3)
 +
 +logging.info([Auototest Background GOOD])
 +os._exit(0)
 +
 +
 +def wait_autotest_background(pid):
 +
 +Wait for background autotest finish.
 +
 +@param pid: Pid of the child process executing background autotest
 +
 +logging.info(Waiting for background autotest to finish ...)
 +
 +(pid, s) = os.waitpid(pid,0)
 +status = os.WEXITSTATUS(s)
 +if status != 0:
 +return False
 +return True
 +

I think these functions are unnecessary.  IMO forking is not the clean
way of running autotest in the background.  The kvm_shell_session
object, used to run autotest in the guest, by default runs things in the
background (e.g. session.sendline() returns immediately).
run_autotest(), which uses kvm_shell_session, blocks until the autotest
test is done.  So in order to run autotest in the background, we should
modify run_autotest(), or break it up into smaller parts, to make it
nonblocking.  There's no need to implement yet another wrapper.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Unify KVM kernel-space and user-space code into a single project

2010-04-08 Thread Antoine Martin
Avi Kivity wrote:
 On 03/24/2010 06:40 PM, Joerg Roedel wrote:

 Looks trivial to find a guest, less so with enumerating (still doable).
  
 Not so trival and even more likely to break. Even it perf has the pid of
 the process and wants to find the directory it has to do:

 1. Get the uid of the process
 2. Find the username for the uid
 3. Use the username to find the home-directory

 Steps 2. and 3. need nsswitch and/or pam access to get this information
 from whatever source the admin has configured. And depending on what the
 source is it may be temporarily unavailable causing nasty timeouts. In
 short, there are many weak parts in that chain making it more likely to
 break.

 
 It's true.  If the kernel provides something, there are fewer things
 that can break.  But if your system is so broken that you can't resolve
 uids, fix that before running perf.  Must we design perf for that case?
uid to username can fail when using chroots, or worse point to an
incorrect location (and yes, I do use this)

Sorry if this has been covered / discussion has moved on. Just catching
up with the 500+ messages in my inbox..

Antoine


 
 After all, 'ls -l' will break under the same circumstances.  It's hard
 to imagine doing useful work when that doesn't work.
 
 A kernel-based approach with /proc/pid/kvm does not have those issues
 (and to repeat myself, it is independent from the userspace being used).

 
 It has other issues, which are IMO more problematic.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM Test: Add ioquit test case

2010-04-08 Thread Michael Goldish
On 04/07/2010 11:49 AM, Feng Yang wrote:
 Signed-off-by: Feng Yang fy...@redhat.com
 ---
  client/tests/kvm/tests/ioquit.py   |   54 
 
  client/tests/kvm/tests_base.cfg.sample |4 ++
  2 files changed, 58 insertions(+), 0 deletions(-)
  create mode 100644 client/tests/kvm/tests/ioquit.py
 
 diff --git a/client/tests/kvm/tests/ioquit.py 
 b/client/tests/kvm/tests/ioquit.py
 new file mode 100644
 index 000..c75a0e3
 --- /dev/null
 +++ b/client/tests/kvm/tests/ioquit.py
 @@ -0,0 +1,54 @@
 +import logging, time, random, signal, os
 +from autotest_lib.client.common_lib import error
 +import kvm_test_utils, kvm_utils
 +
 +
 +def run_ioquit(test, params, env):
 +
 +Emulate the poweroff under IO workload(dbench so far) using monitor
 +command 'quit'.
 +
 +@param test: Kvm test object
 +@param params: Dictionary with the test parameters.
 +@param env: Dictionary with test environment.
 +

- Can you explain the goal of this test?  Why quit a VM under IO
workload?  What results do you expect?  How can the test ever fail?

- Why is dbench any better for this purpose than dd or some other simple
command?  Using dbench isn't necessarily bad, I'm just curious.

 +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 +session = kvm_test_utils.wait_for_login(vm,
 +  timeout=int(params.get(login_timeout, 360)))
 +session2 = kvm_test_utils.wait_for_login(vm,
 +  timeout=int(params.get(login_timeout, 360)))
 +def is_autotest_launched():
 +if session.get_command_status(pgrep autotest) != 0:
 +logging.debug(Autotest process not found)
 +return False
 +return True
 +
 +test_name = params.get(background_test, dbench)
 +control_file = params.get(control_file, dbench.control)
 +timeout = int(params.get(test_timeout, 300))
 +control_path = os.path.join(test.bindir, autotest_control,
 +control_file)
 +outputdir = test.outputdir
 +
 +pid = kvm_test_utils.run_autotest_background(vm, session2, control_path,
 + timeout, test_name,
 + outputdir)

As mentioned in the other message, I don't think it's necessary to fork
and use run_autotest() in a separate process.  Instead we should modify
run_autotest() to support non-blocking operation (if we need that at all).

 +if pid  0:
 +raise error.TestError(Could not create child process to execute 
 +  autotest background)
 +
 +if kvm_utils.wait_for(is_autotest_launched, 240, 0, 2):
 +logging.debug(Background autotest successfully)
 +else:
 +logging.debug(Background autotest failed, start the test anyway)
 +
 +time.sleep(100 + random.randrange(0,100))
 +logging.info(Kill the virtual machine)
 +vm.process.close()

This will do a 'kill -9' on the qemu process.  Didn't you intend to use
a 'quit'?  To do that, you should use vm.destroy(gracefully=False).

 +logging.info(Kill the tracking process)
 +kvm_utils.safe_kill(pid, signal.SIGKILL)
 +kvm_test_utils.wait_autotest_background(pid)
 +session.close()
 +session2.close()
 +
 diff --git a/client/tests/kvm/tests_base.cfg.sample 
 b/client/tests/kvm/tests_base.cfg.sample
 index 9b12fc2..d8530f6 100644
 --- a/client/tests/kvm/tests_base.cfg.sample
 +++ b/client/tests/kvm/tests_base.cfg.sample
 @@ -305,6 +305,10 @@ variants:
  - ksm_parallel:
  ksm_mode = parallel
  
 +- ioquit:
 +type = ioquit
 +control_file = dbench.control.200
 +background_test = dbench

You should probably add extra_params +=  -snapshot because this test
can break the filesystem.

  # system_powerdown, system_reset and shutdown *must* be the last ones
  # defined (in this order), since the effect of such tests can leave
  # the VM on a bad state.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Marcelo Tosatti
On Thu, Apr 08, 2010 at 11:05:56AM +0300, Avi Kivity wrote:
 On 04/08/2010 10:54 AM, Jan Kiszka wrote:
 
 Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?
 
 Just like we manipulate the flags for guest debugging in the
 set/get_rflags vendor handlers, the same should happen for IOPL and VM.
 This is no business of enter_pmode/rmode.
 
 This is vendor specific code, and it isn't manipulating guest values,
 only host values (-set_rflags() is called when the guest value changes,
 which isn't happening here).  Of course some refactoring will be helpful
 here.
 Actually, the bug is that enter_pmode/rmode update save_iopl (and that
 no one saves the VM bit). That should happen in vmx_set_rflags to also
 keep track of changes _while_ we are in rmode.
 
 Exactly - that's what I suggested above.

And new ioctl to save/restore save_iopl/save_vm.

 enter_rmode/pmode should
 just trigger a set_rflags to update things.
 
 Not what I had in mind, but a valid implementation.
 
 And vmx_get_rflags must
 properly inject the saved flags instead of masking them out.
 
 Yes.  No one ever bothers to play with iopl in real mode, so we
 never noticed this.  We do this for cr0 for example.
 
 It's a bugfix that can go into -stable and supported distribution kernels.
 Well, would be happy to throw out tones of workaround based on this
 approach. :)

Do you mean you'd be interested in writing the patch? Sure, go ahead,
let me know otherwise.

 And I'll be happy to apply such patches.  Just ensure that 2.6.32.y
 and above have the fixes so we don't introduce regressions (I think
 most workarounds are a lot older).
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Avi Kivity

On 04/08/2010 05:16 PM, Marcelo Tosatti wrote:

On Thu, Apr 08, 2010 at 11:05:56AM +0300, Avi Kivity wrote:
   

On 04/08/2010 10:54 AM, Jan Kiszka wrote:
 
   

Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?

 

Just like we manipulate the flags for guest debugging in the
set/get_rflags vendor handlers, the same should happen for IOPL and VM.
This is no business of enter_pmode/rmode.

   

This is vendor specific code, and it isn't manipulating guest values,
only host values (-set_rflags() is called when the guest value changes,
which isn't happening here).  Of course some refactoring will be helpful
here.
 

Actually, the bug is that enter_pmode/rmode update save_iopl (and that
no one saves the VM bit). That should happen in vmx_set_rflags to also
keep track of changes _while_ we are in rmode.
   

Exactly - that's what I suggested above.
 

And new ioctl to save/restore save_iopl/save_vm.
   


That ioctl already exists, KVM_{GET,SET}_REGS.

We're writing via KVM_SET_SREGS eflags.vm=1 and eflags.iopl=3 while 
cr0.pe=0.  vmx_set_rflags() notices this and sets rmode.save_vm=1 and 
rmode.save_iopl=3.  Next we write via KVM_SET_SREGS cr0.pe=1.  So we 
call enter_pmode(), and recover eflags.vm and eflags.iopl from rmode.vm 
and rmode.iopl.  Win!


It's similar to how we handle cr0.ts, sometimes the host owns it so we 
keep it in a shadow register, sometimes the guest owns it so we keep it 
in cr0.



It's a bugfix that can go into -stable and supported distribution kernels.
 

Well, would be happy to throw out tones of workaround based on this
approach. :)
   

Do you mean you'd be interested in writing the patch? Sure, go ahead,
let me know otherwise.
   


I took it to mean he wants to kill the other qemu workarounds for kernel 
bugs.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Marcelo Tosatti
On Thu, Apr 08, 2010 at 09:54:35AM +0200, Jan Kiszka wrote:
  The following patch fixes it, but it has some drawbacks:
 
  - cpu_synchronize_state+writeback is noticeably slow with tpr patching,
  this makes it slower.
 
 
  Isn't it a very rare event?
   
  It has to be - otherwise the decision to go for full sync and individual
  get/set IOCTL would have been wrong. What happens during tpr patching?
 
 
  
  tpr patching listens for instructions which access the tpr and patches 
  them to a call instruction (targeting some hacky code in the bios).  
  Since there are a limited number of such instructions (20-30 IIRC) you 
  expect tpr patching to happen very rarely.
 
 Then I wonder why it is noticeable.

While switching kvm-tpr-opt.c from explicit {get,put}_{s}regs
to cpu_synchronize_state+writeback i noticed WinXP.32 boot
became visually slower. For some reason, the delay introduced by
cpu_synchronize_state+writeback forbids patching certain instructions
for longer periods, or somehow allows Windows to use unpatched
instructions more often /guess. End result was 4x more patching (from
700 to 4000, roughly). Confirmed it was a timing issue by introducing
delays to original {get,put}_{s}regs version.

The particular tpr case is no big deal since as mentioned its a short
lived period, but for things like Kemari this might be an issue. But 
this is another discussion.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-08 Thread Jan Kiszka
Marcelo Tosatti wrote:
 On Thu, Apr 08, 2010 at 11:05:56AM +0300, Avi Kivity wrote:
 On 04/08/2010 10:54 AM, Jan Kiszka wrote:
 Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?

 Just like we manipulate the flags for guest debugging in the
 set/get_rflags vendor handlers, the same should happen for IOPL and VM.
 This is no business of enter_pmode/rmode.

 This is vendor specific code, and it isn't manipulating guest values,
 only host values (-set_rflags() is called when the guest value changes,
 which isn't happening here).  Of course some refactoring will be helpful
 here.
 Actually, the bug is that enter_pmode/rmode update save_iopl (and that
 no one saves the VM bit). That should happen in vmx_set_rflags to also
 keep track of changes _while_ we are in rmode.
 Exactly - that's what I suggested above.
 
 And new ioctl to save/restore save_iopl/save_vm.

Not need. The information will all be contained in eflags and cr0 as
returned to userspace. The bug is that the wrong information is
currently returned, thus saved/restored.

 
 enter_rmode/pmode should
 just trigger a set_rflags to update things.
 Not what I had in mind, but a valid implementation.

 And vmx_get_rflags must
 properly inject the saved flags instead of masking them out.
 Yes.  No one ever bothers to play with iopl in real mode, so we
 never noticed this.  We do this for cr0 for example.

 It's a bugfix that can go into -stable and supported distribution kernels.
 Well, would be happy to throw out tones of workaround based on this
 approach. :)
 
 Do you mean you'd be interested in writing the patch? Sure, go ahead,
 let me know otherwise.

ATM, I fighting against too many customer bugs. And you have the test
case for this particular issue, I assume. So don't wait for me here.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GSoC 2010] Pass-through filesystem support.

2010-04-08 Thread Mohammed Gamal
Hi,
Now that Cam is almost done with his ivshmem patches, I was thinking
of another idea for GSoC which is improving the pass-though
filesystems.
I've got some questions on that:

1- What does the community prefer to use and improve? CIFS, 9p, or
both? And which is better taken up for GSoC.

2- With respect to CIFS. I wonder how the shares are supposed to be
exposed to the guest. Should the Samba server be modified to be able
to use unix domain sockets instead of TCP ports and then QEMU
communicating on these sockets. With that approach, how should the
guest be able to see the exposed share? And what is the problem of
using Samba with TCP ports?

3- In addition, I see the idea mentions that some Windows code needs
to be written to use network shares on a special interface. What's
that interface? And what's the nature of that Windows code? (a driver
a la guest additions?)

Regards,
Mohammed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GSoC 2010] Pass-through filesystem support.

2010-04-08 Thread Mohammed Gamal
On Thu, Apr 8, 2010 at 6:01 PM, Mohammed Gamal m.gamal...@gmail.com wrote:
 Hi,
 Now that Cam is almost done with his ivshmem patches, I was thinking
 of another idea for GSoC which is improving the pass-though
 filesystems.
 I've got some questions on that:

 1- What does the community prefer to use and improve? CIFS, 9p, or
 both? And which is better taken up for GSoC.

 2- With respect to CIFS. I wonder how the shares are supposed to be
 exposed to the guest. Should the Samba server be modified to be able
 to use unix domain sockets instead of TCP ports and then QEMU
 communicating on these sockets. With that approach, how should the
 guest be able to see the exposed share? And what is the problem of
 using Samba with TCP ports?

 3- In addition, I see the idea mentions that some Windows code needs
 to be written to use network shares on a special interface. What's
 that interface? And what's the nature of that Windows code? (a driver
 a la guest additions?)

 Regards,
 Mohammed


P.S.: A gentle reminder. The proposal submission deadline is tomorrow,
so I'd appreciate responses as soon as possible.

Regards,
Mohammed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-04-08 Thread Stefan Hajnoczi
On Fri, Mar 26, 2010 at 6:53 PM, Eran Rom er...@il.ibm.com wrote:
 Christoph Hellwig hch at infradead.org writes:


 Ok.  cache=writeback performance is something I haven't bothered looking
 at at all.  For cache=none any streaming write or random workload with
 large enough record sizes got basically the same performance as native
 using kernel aio, and same for write but slightly degraded for reads
 using the thread pool.  See my attached JLS presentation for some
 numbers.

 Looks like the presentation did not make it...

I am interested in the JLS presentation too.  Here is what I found,
hope it's the one you meant, Christoph:

http://events.linuxfoundation.org/images/stories/slides/jls09/jls09_hellwig.odp

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GSoC 2010] Pass-through filesystem support.

2010-04-08 Thread Stefan Hajnoczi
On Thu, Apr 8, 2010 at 5:02 PM, Mohammed Gamal m.gamal...@gmail.com wrote:
 On Thu, Apr 8, 2010 at 6:01 PM, Mohammed Gamal m.gamal...@gmail.com wrote:
 1- What does the community prefer to use and improve? CIFS, 9p, or
 both? And which is better taken up for GSoC.

There have been recent patches for filesystem passthrough using 9P:

http://www.mail-archive.com/qemu-de...@nongnu.org/msg28100.html

You might want to consider them if you haven't seen them already.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problem with KVM guest switching to x86 long mode

2010-04-08 Thread Pekka Enberg
Hi!

I am working on a light-weight KVM userspace launcher for Linux and am
bit stuck with a guest Linux kernel restarting when it tries to enter
long mode.

The register dump looks like this:

penb...@tiger:~/vm$ ./kvm bzImage
KVM exit reason: 8 (KVM_EXIT_SHUTDOWN)
Registers:
 rip: 001000ed   rsp: 005d54b8 flags: 00010046
 rax: 8001   rbx: 01f2c000   rcx: c080
 rdx:    rsi: 00013670   rdi: 02408000
 rbp: 0010   r8:     r9:  
 r10:    r11:    r12: 
 r13:    r14:    r15: 
 cr0: 8011   cr2: 001000ed   cr3: 02402000
 cr4: 0020   cr8: 
Segment registers:
 register  selector  base  limit type  p dpl db s l g avl
 cs0010      0b1 0   1  1 0 1 0
 ss0018      031 0   1  1 0 1 0
 ds0018      031 0   1  1 0 1 0
 es0018      031 0   1  1 0 1 0
 fs0018      031 0   1  1 0 1 0
 gs0018      031 0   1  1 0 1 0
 tr0020  1000  0067  0b1 0   0  0 0 0 0
 ldt         000 0   0  0 0 0 0
 [ efer: 0500  apic base:   nmi: disabled ]
Interrupt bitmap:
    
Code: 08 49 75 f3 8d 83 00 60 4d 00 0f 22 d8 b9 80 00 00 c0 0f 32 0f
ba e8 08 0f 30 6a 10 8d 85 00 02 00 00 50 b8 01 00 00 80 0f 22 c0 cb
f4 eb fd 9c 6a 00 9d 9c 58 89 c3 35 00 00 20 00 50 9d 9c 58

Using Linux 'scripts/decodecode', we can see that we are at
startup_32() of arch/x86/boot/compressed/head_64.S:

All code

   0:   08 49 75or %cl,0x75(%rcx)
   3:   f3 8d 83 00 60 4d 00repz lea 0x4d6000(%rbx),%eax
   a:   0f 22 d8mov%rax,%cr3
   d:   b9 80 00 00 c0  mov$0xc080,%ecx
  12:   0f 32   rdmsr
  14:   0f ba e8 08 bts$0x8,%eax
  18:   0f 30   wrmsr
  1a:   6a 10   pushq  $0x10
  1c:   8d 85 00 02 00 00   lea0x200(%rbp),%eax
  22:   50  push   %rax
  23:   b8 01 00 00 80  mov$0x8001,%eax
  28:   0f 22 c0mov%rax,%cr0
  2b:*  cb  lret-- trapping instruction
  2c:   f4  hlt
  2d:   eb fd   jmp0x2c
  2f:   9c  pushfq
  30:   6a 00   pushq  $0x0
  32:   9d  popfq
  33:   9c  pushfq
  34:   58  pop%rax
  35:   89 c3   mov%eax,%ebx
  37:   35 00 00 20 00  xor$0x20,%eax
  3c:   50  push   %rax
  3d:   9d  popfq
  3e:   9c  pushfq
  3f:   58  pop%rax

I already asked Avi in private about this and he suggested I'd post a
register dump to the list. Please note that I am in no way ruling out
a bug in our fakebios emulation but my gut feeling is that I'm just
missing something obvious in the KVM setup.

For those that might be interested, source code to the launcher is
available here:

  git clone git://github.com/penberg/vm.git

Launching a Linux kernel is as simple as:

  make ; ./kvm bzImage

Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with KVM guest switching to x86 long mode

2010-04-08 Thread Avi Kivity

On 04/08/2010 09:26 PM, Pekka Enberg wrote:

Hi!

I am working on a light-weight KVM userspace launcher for Linux and am
bit stuck with a guest Linux kernel restarting when it tries to enter
long mode.

The register dump looks like this:

penb...@tiger:~/vm$ ./kvm bzImage
KVM exit reason: 8 (KVM_EXIT_SHUTDOWN)
Registers:
  rip: 001000ed   rsp: 005d54b8 flags: 00010046
  rax: 8001   rbx: 01f2c000   rcx: c080
  rdx:    rsi: 00013670   rdi: 02408000
  rbp: 0010   r8:     r9:  
  r10:    r11:    r12: 
  r13:    r14:    r15: 
  cr0: 8011   cr2: 001000ed   cr3: 02402000
  cr4: 0020   cr8: 
Segment registers:
  register  selector  base  limit type  p dpl db s l g avl
  cs0010      0b1 0   1  1 0 1 0
  ss0018      031 0   1  1 0 1 0
  ds0018      031 0   1  1 0 1 0
  es0018      031 0   1  1 0 1 0
  fs0018      031 0   1  1 0 1 0
  gs0018      031 0   1  1 0 1 0
  tr0020  1000  0067  0b1 0   0  0 0 0 0
  ldt         000 0   0  0 0 0 0
   


These all look reasonable.  Please add a gdtr dump and an idtr dump.


   2b:* cb  lret-- trapping instruction
   


Post the two u32s at ss:rsp - ss:rsp+8.  That will tell us where the 
guest is trying to return.  Actually, from the dump:


 1a:6a 10pushq  $0x10
  1c:8d 85 00 02 00 00lea0x200(%rbp),%eax
  22:50   push   %rax

it looks like you're returning to segment 0x10, this should be the word 
at ss:rsp+4.  So if you dump the 2 u32s at 
gdtr.base+0x10..gdtr.base+0x18 we'll see if there's anything wrong with 
the segment descriptor.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with KVM guest switching to x86 long mode

2010-04-08 Thread Pekka Enberg

Avi Kivity wrote:

These all look reasonable.  Please add a gdtr dump and an idtr dump.


Done.


   2b:*cb   lret-- trapping instruction
   


Post the two u32s at ss:rsp - ss:rsp+8.  That will tell us where the 
guest is trying to return.  Actually, from the dump:


 1a:6a 10pushq  $0x10
  1c:8d 85 00 02 00 00lea0x200(%rbp),%eax
  22:50   push   %rax

it looks like you're returning to segment 0x10, this should be the word 
at ss:rsp+4.  So if you dump the 2 u32s at 
gdtr.base+0x10..gdtr.base+0x18 we'll see if there's anything wrong with 
the segment descriptor.


Here you go:

penb...@tiger:~/vm$ ./kvm bzImage
KVM exit reason: 8 (KVM_EXIT_SHUTDOWN)
Registers:
 rip: 001000ed   rsp: 005d54b8 flags: 00010046
 rax: 8001   rbx: 01f2c000   rcx: c080
 rdx:    rsi: 00013670   rdi: 02408000
 rbp: 0010   r8:     r9:  
 r10:    r11:    r12: 
 r13:    r14:    r15: 
 cr0: 8011   cr2: 001000ed   cr3: 02402000
 cr4: 0020   cr8: 
Segment registers:
 register  selector  base  limit type  p dpl db s l g avl
 cs0010      0b1 0   1  1 0 1 0
 ss0018      031 0   1  1 0 1 0
 ds0018      031 0   1  1 0 1 0
 es0018      031 0   1  1 0 1 0
 fs0018      031 0   1  1 0 1 0
 gs0018      031 0   1  1 0 1 0
 tr0020  1000  0067  0b1 0   0  0 0 0 0
 ldt         000 0   0  0 0 0 0
 gdt 005ca458  0030
 idt   
 [ efer: 0500  apic base:   nmi: disabled ]
Interrupt bitmap:
    
Code: 08 49 75 f3 8d 83 00 60 4d 00 0f 22 d8 b9 80 00 00 c0 0f 32 0f ba 
e8 08 0f 30 6a 10 8d 85 00 02 00 00 50 b8 01 00 00 80 0f 22 c0 cb f4 
eb fd 9c 6a 00 9d 9c 58 89 c3 35 00 00 20 00 50 9d 9c 58

Stack:
  0x005d54b8: 00 02 10 00  10 00 00 00 -- return value
  0x005d54c0: 00 00 00 00  00 00 00 00
  0x005d54c8: 00 00 00 00  00 00 00 00
  0x005d54d0: 00 00 00 00  00 00 00 00
GDT:
  0x005ca458: 30 00 58 a4  5c 00 00 00
  0x005ca460: 00 00 00 00  00 00 00 00
  0x005ca468: ff ff 00 00  00 9a af 00 -- gtr.base + 0x10
  0x005ca470: ff ff 00 00  00 92 cf 00
  0x005ca478: 00 00 00 00  00 89 80 00
  0x005ca480: 00 00 00 00  00 00 00 00

Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hugetlbfs and KSM

2010-04-08 Thread Bernhard Schmidt
Hi,

running Debian Squeeze with a 2.6.32-3-amd64 kernel and qemu-kvm 0.12.3
I enabled hugetlbfs on a rather small box with about five similar VMs
today (all Debian Squeeze amd64, but different services)

Pro:
* system load on the host has gone way down (by about 50%)

Contra:
* KSM seems to be largely ineffective (100MB saved - 1.3MB saved)

Am I doing something wrong? Is this a bug? Is this generally impossible
with large pages (which might explain the lower load on the host, if
large pages are not scanned)? Or is it just way less likely to have
identical pages at that size?

Bernhard

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with KVM guest switching to x86 long mode

2010-04-08 Thread Avi Kivity

On 04/08/2010 09:59 PM, Pekka Enberg wrote:



   2b:*cb   lret-- trapping instruction


Post the two u32s at ss:rsp - ss:rsp+8.  That will tell us where the 
guest is trying to return.  Actually, from the dump:


 1a:6a 10pushq  $0x10
  1c:8d 85 00 02 00 00lea0x200(%rbp),%eax
  22:50   push   %rax

it looks like you're returning to segment 0x10, this should be the 
word at ss:rsp+4.  So if you dump the 2 u32s at 
gdtr.base+0x10..gdtr.base+0x18 we'll see if there's anything wrong 
with the segment descriptor.


Here you go:


I was asking for the wrong things.



penb...@tiger:~/vm$ ./kvm bzImage
KVM exit reason: 8 (KVM_EXIT_SHUTDOWN)
Registers:
 rip: 001000ed   rsp: 005d54b8 flags: 00010046
 rax: 8001   rbx: 01f2c000   rcx: c080
 rdx:    rsi: 00013670   rdi: 02408000
 rbp: 0010   r8:     r9:  
 r10:    r11:    r12: 
 r13:    r14:    r15: 
 cr0: 8011   cr2: 001000ed   cr3: 02402000


cr2 points at rip.  So it isn't lret not executing correctly, it's the 
cpu not able to fetch lret at all.


The code again:


   23:  b8 01 00 00 80  mov$0x8001,%eax
   28:  0f 22 c0mov%rax,%cr0
   2b:* cb  lret-- trapping instruction
   


The instruction at 0x28 is enabling paging, next insn fetch faults, so 
the paging structures must be incorrect.


Questions:
- what is the u64 at cr3? (call it pte4)
- what is the u64 at (pte4  ~0xfff)?  (call it pte3)
- what is the u64 at (pte3  ~0xfff)? (pte2)
- what is the u64 at ((pte2  ~0xfff) + 2048)? (pte1)

Note if bit 7 of pte2 is set, then pte1 is unneeded.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hugetlbfs and KSM

2010-04-08 Thread David Martin
I asked this question quite a while ago, it seems huge pages do not get scanned 
for merging.

David Martin

- Bernhard Schmidt be...@birkenwald.de wrote:

 Hi,
 
 running Debian Squeeze with a 2.6.32-3-amd64 kernel and qemu-kvm
 0.12.3
 I enabled hugetlbfs on a rather small box with about five similar VMs
 today (all Debian Squeeze amd64, but different services)
 
 Pro:
 * system load on the host has gone way down (by about 50%)
 
 Contra:
 * KSM seems to be largely ineffective (100MB saved - 1.3MB saved)
 
 Am I doing something wrong? Is this a bug? Is this generally
 impossible
 with large pages (which might explain the lower load on the host, if
 large pages are not scanned)? Or is it just way less likely to have
 identical pages at that size?
 
 Bernhard
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH UNTESTED] KVM: VMX: Save/restore rflags.vm correctly in real mode

2010-04-08 Thread Marcelo Tosatti
On Thu, Apr 08, 2010 at 06:19:35PM +0300, Avi Kivity wrote:
 Currently we set eflags.vm unconditionally when entering real mode emulation
 through virtual-8086 mode, and clear it unconditionally when we enter 
 protected
 mode.  The means that the following sequence
 
   KVM_SET_REGS  (rflags.vm=1)
   KVM_SET_SREGS (cr0.pe=1)
 
 Ends up with rflags.vm clear due to KVM_SET_SREGS triggering enter_pmode().
 
 Fix by shadowing rflags.vm (and rflags.iopl) correctly while in real mode:
 reads and writes to those bits access a shadow register instead of the actual
 register.
 
 Signed-off-by: Avi Kivity a...@redhat.com

Tested and applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm autotest, how to disable address cache

2010-04-08 Thread Ryan Harper
Is there any way to disable this?  I'm running a guest on -net user
networking, no interaction with the host network, yet, during the test,
I get tons of:

15:50:48 DEBUG| (address cache) Adding cache entry: 00:1a:64:39:04:91 --- 
10.0.253.16
15:50:49 DEBUG| (address cache) Adding cache entry: e4:1f:13:2c:e5:04 --- 
10.0.253.132

many times for the same mapping.  If I'm not using tap networking on a
public bridge, what's this address cache doing for me? And, how the heck
do turn this off?


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-08 Thread Richard Simpson
On 08/04/10 09:52, Andre Przywara wrote:

 Can you try to boot the attached multiboot kernel, which just outputs
 a brief CPUID dump?
 $ qemu-kvm -kernel cpuid_mb -vnc :0
 (Unfortunately I have no serial console support in there yet, so you
 either have to write the values down or screenshot it).
 In the 4th line from the button it should print NX (after SYSCALL).

OK, that was fun!  Resulting screen shots are attached.

...default.png  With command line above.
...cpu_host.png With -cpu host option added.
...no_kvm.png   With -no-kvm option added.

I hope that helps!

Richard
attachment: cpuid_mb_screendump_cpu_host.pngattachment: cpuid_mb_screendump_default.pngattachment: cpuid_mb_screendump_no_kvm.png

Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-04-08 Thread Antoine Martin


Antoine Martin wrote:
 On 03/08/2010 02:35 AM, Avi Kivity wrote:
 On 03/07/2010 09:25 PM, Antoine Martin wrote:
 On 03/08/2010 02:17 AM, Avi Kivity wrote:
 On 03/07/2010 09:13 PM, Antoine Martin wrote:
 What version of glibc do you have installed?

 Latest stable:
 sys-devel/gcc-4.3.4
 sys-libs/glibc-2.10.1-r1


 $ git show glibc-2.10~108 | head
 commit e109c6124fe121618e42ba882e2a0af6e97b8efc
 Author: Ulrich Drepper drep...@redhat.com
 Date:   Fri Apr 3 19:57:16 2009 +

 * misc/Makefile (routines): Add preadv, preadv64, pwritev,
 pwritev64.

 * misc/Versions: Export preadv, preadv64, pwritev, pwritev64
 for
 GLIBC_2.10.
 * misc/sys/uio.h: Declare preadv, preadv64, pwritev, pwritev64.
 * sysdeps/unix/sysv/linux/kernel-features.h: Add entries for
 preadv

 You might get away with rebuilding glibc against the 2.6.33 headers.

 The latest kernel headers available in gentoo (and they're masked
 unstable):
 sys-kernel/linux-headers-2.6.32

 So I think I will just keep using Christoph's patch until .33 hits
 portage.
 Unless there's any reason not to? I would rather keep my system clean.
 I can try it though, if that helps you clear things up?

 preadv/pwritev was actually introduced in 2.6.30.  Perhaps you last
 build glibc before that?  If so, a rebuild may be all that's necessary.

 To be certain, I've rebuilt qemu-kvm against:
 linux-headers-2.6.33 + glibc-2.10.1-r1 (both freshly built)
 And still no go!
 I'm still having to use the patch which disables preadv unconditionally...

Better late than never, here's the relevant part of the strace (for the
unpatched case where it fails):

stat(./fs, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 41), ...}) = 0
open(./fs, O_RDWR|O_DIRECT|O_CLOEXEC) = 12

lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31266] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31266] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31266] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31266] lseek(12, 0, SEEK_SET)  = 0
[pid 31266] read(12,
\240\246E\32\r\21\367c\212\316Xn\177e'\310}\234\1\273`\371\266\247\r\1nj\332\32\221\26...,
512) = 512
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12, iQ\35
\271O\203vj\ve[Ni}\355\263\272\4#yMo\266.\341\21\340Y5\204\20..., 4096,
1321851805696) = 4096
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31273] pread(12,  unfinished ...
[pid 31267] lseek(12, 0, SEEK_END)  = 1321851815424
[pid 31271] pread(12,  unfinished ...
[pid 31267] lseek(12, 

Re: Setting nx bit in virtual CPU

2010-04-08 Thread Richard Simpson
On 08/04/10 08:23, Avi Kivity wrote:

 Strange.  Can you hack qemu-kvm's cpuid code where it issues the ioctl
 KVM_SET_CPUID2 to show what the data is?  I'm not where that code is in
 your version of qemu-kvm.

Gad, the last time I tried to mess around with this sort of low level
code was many years ago when I was a keen young bachelor burning the
midnight oil trying to get the weird IDE controller on my Alpha to work
properly!  Anyway, I have tried to give it a go.

I found a file called qemu-kvm-x86.c

It contained a function called kvm_setup_cpuid2 which I modified as follows:

int kvm_setup_cpuid2(CPUState *env, int nent,
 struct kvm_cpuid_entry2 *entries)
{
struct kvm_cpuid2 *cpuid;
int r, i;
fprintf(stderr, cpuid=nent %d\n, nent);
for (i=0; i  nent; i++) {
fprintf(stderr, %x %x %x %x %x %x %x\n,
entries[i].function, entries[i].index, entries[i].flags, entries[i].eax,
entries[i].ebx, entries[i].ecx, entries[i].edx);
}
cpuid = qemu_malloc(sizeof(*cpuid) + nent * sizeof(*entries));

cpuid-nent = nent;
memcpy(cpuid-entries, entries, nent * sizeof(*entries));
r = kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid);
free(cpuid);
return r;
}

So, basically I go round a loop and print out the contents of each
kvm_cpuid_entry2 structure.

Results below, using Andre Przywara's handy nano-kernel.  I do hope that
some of this makes some kind of sense!

qemu-kvm -kernel cpuid_mb -vnc :0

cpuid=nent 21
4000 0 0 0 4b4d564b 564b4d56 4d
4001 0 0 7 0 0 0
0 0 0 4 68747541 444d4163 69746e65
1 0 0 623 800 80002001 78bfbfd
2 0 0 1 0 0 2c307d
3 0 0 0 0 0 0
4 0 1 121 1c0003f 3f 1
4 1 1 122 1c0003f 3f 1
4 2 1 143 3c0003f fff 1
4 3 1 0 0 0 0
8000 0 0 800a 68747541 444d4163 69746e65
8001 0 0 623 0 1 2181abfd
8002 0 0 554d4551 72695620 6c617574 55504320
8003 0 0 72657620 6e6f6973 312e3020 332e32
8004 0 0 0 0 0 0
8005 0 0 1ff01ff 1ff01ff 40020140 40020140
8006 0 0 0 42004200 2008140 0
8007 0 0 0 0 0 0
8008 0 0 3028 0 0 0
8009 0 0 0 0 0 0
800a 0 0 1 10 0 0

qemu-kvm -kernel cpuid_mb -cpu host -vnc :0

cpuid=nent 29
4000 0 0 0 4b4d564b 564b4d56 4d
4001 0 0 7 0 0 0
0 0 0 1 68747541 444d4163 69746e65
1 0 0 40ff2 800 80002001 78bfbff
8000 0 0 8018 68747541 444d4163 69746e65
8001 0 0 40ff2 0 1 23c3fbff
8002 0 0 20444d41 6c687441 74286e6f 3620296d
8003 0 0 72502034 7365636f 20726f73 30303233
8004 0 0 2b 0 0 0
8005 0 0 1ff01ff 1ff01ff 40020140 40020140
8006 0 0 0 42004200 2008140 0
8007 0 0 0 0 0 0
8008 0 0 3028 0 0 0
8009 0 0 0 0 0 0
800a 0 0 1 10 0 0
800b 0 0 0 0 0 0
800c 0 0 0 0 0 0
800d 0 0 0 0 0 0
800e 0 0 0 0 0 0
800f 0 0 0 0 0 0
8010 0 0 0 0 0 0
8011 0 0 0 0 0 0
8012 0 0 0 0 0 0
8013 0 0 0 0 0 0
8014 0 0 0 0 0 0
8015 0 0 0 0 0 0
8016 0 0 0 0 0 0
8017 0 0 0 0 0 0
8018 0 0 0 0 0 0

If I try with -no-kvm then nothing gets printed, presumably because this
is a kvm specific function and doesn't get called in that case.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-08 Thread Sridhar Samudrala
On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote:
 On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote:
  On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
   Make vhost scalable by creating a separate vhost thread per vhost
   device. This provides better scaling across multiple guests and with
   multiple interfaces in a guest.
  
  Thanks for looking into this. An alternative approach is
  to simply replace create_singlethread_workqueue with
  create_workqueue which would get us a thread per host CPU.
  
  It seems that in theory this should be the optimal approach
  wrt CPU locality, however, in practice a single thread
  seems to get better numbers. I have a TODO to investigate this.
  Could you try looking into this?
 
 Yes. I tried using create_workqueue(), but the results were not good
 atleast when the number of guest interfaces is less than the number
 of CPUs. I didn't try more than 8 guests.
 Creating a separate thread per guest interface seems to be more
 scalable based on the testing i have done so far.
 
 I will try some more tests and get some numbers to compare the following
 3 options.
 - single vhost thread
 - vhost thread per cpu
 - vhost thread per guest virtio interface

Here are the results with netperf TCP_STREAM 64K guest to host on a
8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host 
CPU utilization.

Current default single vhost thread
---
1 guest:  12500  37%
2 guests: 12800  46%
3 guests: 12600  47%
4 guests: 12200  47%
5 guests: 12000  47%
6 guests: 11700  47%
7 guests: 11340  47%
8 guests: 11200  48%

vhost thread per cpu

1 guest:   4900 25%
2 guests: 10800 49%
3 guests: 17100 67%
4 guests: 20400 84%
5 guests: 21000 90%
6 guests: 22500 92%
7 guests: 23500 96%
8 guests: 24500 99%

vhost thread per guest interface

1 guest:  12500 37%
2 guests: 21000 72%
3 guests: 21600 79%
4 guests: 21600 85%
5 guests: 22500 89%
6 guests: 22800 94%
7 guests: 24500 98%
8 guests: 26400 99%

Thanks
Sridhar


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-08 Thread Rick Jones



Here are the results with netperf TCP_STREAM 64K guest to host on a
8-cpu Nehalem system.


I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX?

Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a lot!-) 
but I find it incomplete and also like to run things like single-instance TCP_RR 
and multiple-instance, multiple transaction (./configure --enable-burst) 
TCP_RR tests, particularly when concerned with scaling issues.


happy benchmarking,

rick jones

It shows cumulative bandwidth in Mbps and host 
CPU utilization.


Current default single vhost thread
---
1 guest:  12500  37%
2 guests: 12800  46%

3 guests: 12600  47%
4 guests: 12200  47%
5 guests: 12000  47%
6 guests: 11700  47%
7 guests: 11340  47%
8 guests: 11200  48%

vhost thread per cpu

1 guest:   4900 25%
2 guests: 10800 49%
3 guests: 17100 67%
4 guests: 20400 84%
5 guests: 21000 90%
6 guests: 22500 92%
7 guests: 23500 96%
8 guests: 24500 99%

vhost thread per guest interface

1 guest:  12500 37%
2 guests: 21000 72%
3 guests: 21600 79%
4 guests: 21600 85%
5 guests: 22500 89%
6 guests: 22800 94%
7 guests: 24500 98%
8 guests: 26400 99%

Thanks
Sridhar


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.

2010-04-08 Thread Stephen Hemminger
On Tue, 6 Apr 2010 14:26:29 +0800
Xin, Xiaohui xiaohui@intel.com wrote:

 How do you deal with the DoS problem of hostile user space app posting huge
 number of receives and never getting anything.   
 
 That's a problem we are trying to deal with. It's critical for long term.
 Currently, we tried to limit the pages it can pin, but not sure how much is 
 reasonable.
 For now, the buffers submitted is from guest virtio-net driver, so it's safe 
 in some extent
 just for now.

It is critical even now. Once you get past toy benchmarks you will see things 
like
Java processes with 1000 threads all reading at once. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html