Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 04:21, Nicholas A. Bellinger ha scritto:
 @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
 *vscsi, void *buf)
  struct virtio_scsi_cmd *cmd = buf;
  struct scsi_cmnd *sc = cmd-sc;
  struct virtio_scsi_cmd_resp *resp = cmd-resp.cmd;
 +struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +
 +atomic_dec(tgt-reqs);
  
 
 As tgt-tgt_lock is taken in virtscsi_queuecommand_multi() before the
 atomic_inc_return(tgt-reqs) check, it seems like using atomic_dec() w/o
 smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
 accessors properly, no..?

No, only a single thing is being accessed, and there is no need to
order the decrement with respect to preceding or subsequent accesses to
other locations.

In other words, tgt-reqs is already synchronized with itself, and that
is enough.

(Besides, on x86 smp_mb__after_atomic_dec is a nop).

 +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 +   struct scsi_cmnd *sc)
 +{
 +struct virtio_scsi *vscsi = shost_priv(sh);
 +struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +unsigned long flags;
 +u32 queue_num;
 +
 +/* Using an atomic_t for tgt-reqs lets the virtqueue handler
 + * decrement it without taking the spinlock.
 + */
 +spin_lock_irqsave(tgt-tgt_lock, flags);
 +if (atomic_inc_return(tgt-reqs) == 1) {
 +queue_num = smp_processor_id();
 +while (unlikely(queue_num = vscsi-num_queues))
 +queue_num -= vscsi-num_queues;
 +tgt-req_vq = vscsi-req_vqs[queue_num];
 +}
 +spin_unlock_irqrestore(tgt-tgt_lock, flags);
 +return virtscsi_queuecommand(vscsi, tgt, sc);
 +}
 +
 
 The extra memory barriers to get this right for the current approach are
 just going to slow things down even more for virtio-scsi-mq..

virtio-scsi multiqueue has a performance benefit up to 20% (for a single
LUN) or 40% (on overall bandwidth across multiple LUNs).  I doubt that a
single memory barrier can have that much impact. :)

The way to go to improve performance even more is to add new virtio APIs
for finer control of the usage of the ring.  These should let us avoid
copying the sg list and almost get rid of the tgt_lock; even though the
locking is quite efficient in virtio-scsi (see how tgt_lock and vq_lock
are pipelined so as to overlap the preparation of two requests), it
should give a nice improvement and especially avoid a kmalloc with small
requests.  I may have some time for it next month.

 Jen's approach is what we will ultimately need to re-architect in SCSI
 core if we're ever going to move beyond the issues of legacy host_lock,
 so I'm wondering if maybe this is the direction that virtio-scsi-mq
 needs to go in as well..?

We can see after the block layer multiqueue work goes in...  I also need
to look more closely at Jens's changes.

Have you measured the host_lock to be a bottleneck in high-iops
benchmarks, even for a modern driver that does not hold it in
queuecommand?  (Certainly it will become more important as the
virtio-scsi queuecommand becomes thinner and thinner).  If so, we can
start looking at limiting host_lock usage in the fast path.

BTW, supporting this in tcm-vhost should be quite trivial, as all the
request queues are the same and all serialization is done in the
virtio-scsi driver.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/8] KVM paravirt remote flush tlb

2012-09-04 Thread Avi Kivity
On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
 On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity a...@redhat.com wrote:
 On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
  
  kernbench(lower is better)
  ==
   base  pvflushv4  %improvement
  1VM48.5800   46.8513   3.55846
  2VM   108.1823  104.6410   3.27346
  3VM   183.2733  163.3547  10.86825
  
  ebizzy(higher is better)
  
   base pvflushv4  %improvement
  1VM 2414.5000 2089.8750 -13.44481
  2VM 2167.6250 2371.7500  9.41699
  3VM 1600. 2102.5556 31.40060
  
 
 The regression is worrying.  We're improving the contended case at the
 cost of the non-contended case, this is usually the wrong thing to do.
 Do we have any clear idea of the cause of the regression?
 
 Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
 is around 2%, while for contented case its around 10%. That is what is
 helping contended case.

But what is causing the regression for the uncontended case?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/8] KVM paravirt remote flush tlb

2012-09-04 Thread Nikunj A Dadhania
On Tue, 04 Sep 2012 10:51:06 +0300, Avi Kivity a...@redhat.com wrote:
 On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
  On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity a...@redhat.com wrote:
  On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
   
   kernbench(lower is better)
   ==
base  pvflushv4  %improvement
   1VM48.5800   46.8513   3.55846
   2VM   108.1823  104.6410   3.27346
   3VM   183.2733  163.3547  10.86825
   
   ebizzy(higher is better)
   
base pvflushv4  %improvement
   1VM 2414.5000 2089.8750 -13.44481
   2VM 2167.6250 2371.7500  9.41699
   3VM 1600. 2102.5556 31.40060
   
  
  The regression is worrying.  We're improving the contended case at the
  cost of the non-contended case, this is usually the wrong thing to do.
  Do we have any clear idea of the cause of the regression?
  
  Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
  is around 2%, while for contented case its around 10%. That is what is
  helping contended case.
 
 But what is causing the regression for the uncontended case?
 
Haven't been able to nail that, any clue on how to profile would help.

Regards
Nikunj

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread liu ping fan
On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.

 - protecting MemoryRegion::opaque during dispatch

 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).

Is there log for this topic? Link?

Thanks,
pingfan
 --
 error compiling committee.c: too many arguments to function
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread Avi Kivity
On 09/04/2012 11:17 AM, liu ping fan wrote:
 On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.

 - protecting MemoryRegion::opaque during dispatch

 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).

 Is there log for this topic? Link?

The call has not happened yet (it's in 5 hours 40 minutes), but Jan
can't participate, so it will likely be cancelled.  If you can
participate, maybe we can have it anyway.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread liu ping fan
On Tue, Sep 4, 2012 at 4:21 PM, Avi Kivity a...@redhat.com wrote:
 On 09/04/2012 11:17 AM, liu ping fan wrote:
 On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.

 - protecting MemoryRegion::opaque during dispatch

 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).

 Is there log for this topic? Link?

 The call has not happened yet (it's in 5 hours 40 minutes), but Jan
 can't participate, so it will likely be cancelled.  If you can
 participate, maybe we can have it anyway.

Sorry, I can not attend.

 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 4/4] kvm: i386: Add classic PCI device assignment

2012-09-04 Thread Avi Kivity
On 09/03/2012 10:32 PM, Blue Swirl wrote:
 On Mon, Sep 3, 2012 at 4:14 PM, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 11:27 AM, Markus Armbruster wrote:

 I don't see a point in making contributors avoid non-problems that might
 conceivably become trivial problems some day.  Especially when there's
 no automated help with the avoiding.

 -Wpointer-arith
 
 +1

FWIW, I'm not in favour of enabling it, just pointing out that it
exists.  In general I prefer avoiding unnecessary use of extensions, but
in this case the extension is trivial and improves readability.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 08:46:12AM +0200, Paolo Bonzini wrote:
 Il 04/09/2012 04:21, Nicholas A. Bellinger ha scritto:
  @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
  *vscsi, void *buf)
 struct virtio_scsi_cmd *cmd = buf;
 struct scsi_cmnd *sc = cmd-sc;
 struct virtio_scsi_cmd_resp *resp = cmd-resp.cmd;
  +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +
  +  atomic_dec(tgt-reqs);
   
  
  As tgt-tgt_lock is taken in virtscsi_queuecommand_multi() before the
  atomic_inc_return(tgt-reqs) check, it seems like using atomic_dec() w/o
  smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
  accessors properly, no..?
 
 No, only a single thing is being accessed, and there is no need to
 order the decrement with respect to preceding or subsequent accesses to
 other locations.

 In other words, tgt-reqs is already synchronized with itself, and that
 is enough.

I think your logic is correct and barrier is not needed,
but this needs better documentation.

 (Besides, on x86 smp_mb__after_atomic_dec is a nop).
  +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
  + struct scsi_cmnd *sc)
  +{
  +  struct virtio_scsi *vscsi = shost_priv(sh);
  +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +  unsigned long flags;
  +  u32 queue_num;
  +
  +  /* Using an atomic_t for tgt-reqs lets the virtqueue handler
  +   * decrement it without taking the spinlock.
  +   */

Above comment is not really helpful - reader can be safely assumed to
know what atomic_t is.

Please delete, and replace with the text from commit log
that explains the heuristic used to select req_vq.

Also please add a comment near 'reqs' definition.
Something like number of outstanding requests - used to detect idle
target.


  +  spin_lock_irqsave(tgt-tgt_lock, flags);

Looks like this lock can be removed - req_vq is only
modified when target is idle and only used when it is
not idle.

  +  if (atomic_inc_return(tgt-reqs) == 1) {
  +  queue_num = smp_processor_id();
  +  while (unlikely(queue_num = vscsi-num_queues))
  +  queue_num -= vscsi-num_queues;
  +  tgt-req_vq = vscsi-req_vqs[queue_num];
  +  }
  +  spin_unlock_irqrestore(tgt-tgt_lock, flags);
  +  return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +
  +

.

  +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
  +   struct scsi_cmnd *sc)
  +{
  +   struct virtio_scsi *vscsi = shost_priv(sh);
  +   struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +
  +   atomic_inc(tgt-reqs);
  +   return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +

Here, reqs is unused - why bother incrementing it?
A branch on completion would be cheaper IMHO.


virtio-scsi multiqueue has a performance benefit up to 20%

To be fair, you could be running in single queue mode.
In that case extra atomics and indirection that this code
brings will just add overhead without benefits.
I don't know how significant would that be.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFS over RDMA small block DIRECT_IO bug

2012-09-04 Thread Andrew Holway
Hello.

# Avi Kivity avi(a)redhat recommended I copy kvm in on this. It would also seem 
relevent to libvirt. #

I have a Centos 6.2 server and Centos 6.2 client.

[root@store ~]# cat /etc/exports 
/dev/shm
10.149.0.0/16(rw,fsid=1,no_root_squash,insecure)(I have tried with non 
tempfs targets also)


[root@node001 ~]# cat /etc/fstab 
store.ibnet:/dev/shm /mnt nfs  
rdma,port=2050,defaults 0 0


I wrote a little for loop one liner that dd'd the centos net install image to a 
file called 'hello' then checksummed that file. Each iteration uses a different 
block size.

Non DIRECT_IO seems to work fine. DIRECT_IO with 512byte, 1K and 2K block sizes 
get corrupted.

I want to run my KVM guests on top of NFS over RDMA. My guests cannot create 
filesystems.

Thanks,

Andrew.

bug report: https://bugzilla.linux-nfs.org/show_bug.cgi?id=228

[root@node001 mnt]# for f in 512 1024 2048 4096 8192 16384 32768 65536 131072; 
do dd bs=$f if=CentOS-6.3-x86_64-netinstall.iso of=hello iflag=direct 
oflag=direct  md5sum hello  rm -f hello; done

409600+0 records in
409600+0 records out
209715200 bytes (210 MB) copied, 62.3649 s, 3.4 MB/s
aadd0ffe3c9dfa35d8354e99ecac9276  hello -- 512 byte block 

204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 41.3876 s, 5.1 MB/s
336f6da78f93dab591edc18da81f002e  hello -- 1K block

102400+0 records in
102400+0 records out
209715200 bytes (210 MB) copied, 21.1712 s, 9.9 MB/s
f4cefe0a05c9b47ba68effdb17dc95d6  hello -- 2k block

51200+0 records in
51200+0 records out
209715200 bytes (210 MB) copied, 10.9631 s, 19.1 MB/s
690138908de516b6e5d7d180d085c3f3  hello -- 4k block

25600+0 records in
25600+0 records out
209715200 bytes (210 MB) copied, 5.4136 s, 38.7 MB/s
690138908de516b6e5d7d180d085c3f3  hello

12800+0 records in
12800+0 records out
209715200 bytes (210 MB) copied, 3.1448 s, 66.7 MB/s
690138908de516b6e5d7d180d085c3f3  hello

6400+0 records in
6400+0 records out
209715200 bytes (210 MB) copied, 1.77304 s, 118 MB/s
690138908de516b6e5d7d180d085c3f3  hello

3200+0 records in
3200+0 records out
209715200 bytes (210 MB) copied, 1.4331 s, 146 MB/s
690138908de516b6e5d7d180d085c3f3  hello

1600+0 records in
1600+0 records out
209715200 bytes (210 MB) copied, 0.922167 s, 227 MB/s
690138908de516b6e5d7d180d085c3f3  hello


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 10:46, Michael S. Tsirkin ha scritto:
 +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 + struct scsi_cmnd *sc)
 +{
 +  struct virtio_scsi *vscsi = shost_priv(sh);
 +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +  unsigned long flags;
 +  u32 queue_num;
 +
 +  /* Using an atomic_t for tgt-reqs lets the virtqueue handler
 +   * decrement it without taking the spinlock.
 +   */
 
 Above comment is not really helpful - reader can be safely assumed to
 know what atomic_t is.

Sure, the comment explains that we use an atomic because _elsewhere_ the
tgt_lock is not held while modifying reqs.

 Please delete, and replace with the text from commit log
 that explains the heuristic used to select req_vq.

Ok.

 Also please add a comment near 'reqs' definition.
 Something like number of outstanding requests - used to detect idle
 target.

Ok.

 
 +  spin_lock_irqsave(tgt-tgt_lock, flags);
 
 Looks like this lock can be removed - req_vq is only
 modified when target is idle and only used when it is
 not idle.

If you have two incoming requests at the same time, req_vq is also
modified when the target is not idle; that's the point of the lock.

Suppose tgt-reqs = 0 initially, and you have two processors/queues.
Initially tgt-req_vq is queue #1.  If you have this:

queuecommand on CPU #0 queuecommand #2 on CPU #1
  --
atomic_inc_return(...) == 1
   atomic_inc_return(...) == 2
   virtscsi_queuecommand to queue #1
tgt-req_vq = queue #0
virtscsi_queuecommand to queue #0

then two requests are issued to different queues without a quiescent
point in the middle.

 +  if (atomic_inc_return(tgt-reqs) == 1) {
 +  queue_num = smp_processor_id();
 +  while (unlikely(queue_num = vscsi-num_queues))
 +  queue_num -= vscsi-num_queues;
 +  tgt-req_vq = vscsi-req_vqs[queue_num];
 +  }
 +  spin_unlock_irqrestore(tgt-tgt_lock, flags);
 +  return virtscsi_queuecommand(vscsi, tgt, sc);
 +}
 +
 +
 
 .
 
 +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 +   struct scsi_cmnd *sc)
 +{
 +   struct virtio_scsi *vscsi = shost_priv(sh);
 +   struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +
 +   atomic_inc(tgt-reqs);
 +   return virtscsi_queuecommand(vscsi, tgt, sc);
 +}
 +
 
 Here, reqs is unused - why bother incrementing it?
 A branch on completion would be cheaper IMHO.

Well, I could also let tgt-reqs go negative, but it would be a bit untidy.

Another alternative is to access the target's target_busy field with
ACCESS_ONCE, and drop reqs altogether.  Too tricky to do this kind of
micro-optimization so early, though.

 virtio-scsi multiqueue has a performance benefit up to 20%
 
 To be fair, you could be running in single queue mode.
 In that case extra atomics and indirection that this code
 brings will just add overhead without benefits.
 I don't know how significant would that be.

Not measurable in my experiments.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/8] KVM paravirt remote flush tlb

2012-09-04 Thread Avi Kivity
On 09/04/2012 11:08 AM, Nikunj A Dadhania wrote:
 On Tue, 04 Sep 2012 10:51:06 +0300, Avi Kivity a...@redhat.com wrote:
 On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
  On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity a...@redhat.com wrote:
  On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
   
   kernbench(lower is better)
   ==
base  pvflushv4  %improvement
   1VM48.5800   46.8513   3.55846
   2VM   108.1823  104.6410   3.27346
   3VM   183.2733  163.3547  10.86825
   
   ebizzy(higher is better)
   
base pvflushv4  %improvement
   1VM 2414.5000 2089.8750 -13.44481
   2VM 2167.6250 2371.7500  9.41699
   3VM 1600. 2102.5556 31.40060
   
  
  The regression is worrying.  We're improving the contended case at the
  cost of the non-contended case, this is usually the wrong thing to do.
  Do we have any clear idea of the cause of the regression?
  
  Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
  is around 2%, while for contented case its around 10%. That is what is
  helping contended case.
 
 But what is causing the regression for the uncontended case?
 
 Haven't been able to nail that, any clue on how to profile would help.

perf top, perf kvm top, kvm_stat should help.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 12:25:03PM +0200, Paolo Bonzini wrote:
 Il 04/09/2012 10:46, Michael S. Tsirkin ha scritto:
  +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
  +   struct scsi_cmnd *sc)
  +{
  +struct virtio_scsi *vscsi = shost_priv(sh);
  +struct virtio_scsi_target_state *tgt = 
  vscsi-tgt[sc-device-id];
  +unsigned long flags;
  +u32 queue_num;
  +
  +/* Using an atomic_t for tgt-reqs lets the virtqueue handler
  + * decrement it without taking the spinlock.
  + */
  
  Above comment is not really helpful - reader can be safely assumed to
  know what atomic_t is.
 
 Sure, the comment explains that we use an atomic because _elsewhere_ the
 tgt_lock is not held while modifying reqs.
 
  Please delete, and replace with the text from commit log
  that explains the heuristic used to select req_vq.
 
 Ok.
 
  Also please add a comment near 'reqs' definition.
  Something like number of outstanding requests - used to detect idle
  target.
 
 Ok.
 
  
  +spin_lock_irqsave(tgt-tgt_lock, flags);
  
  Looks like this lock can be removed - req_vq is only
  modified when target is idle and only used when it is
  not idle.
 
 If you have two incoming requests at the same time, req_vq is also
 modified when the target is not idle; that's the point of the lock.
 
 Suppose tgt-reqs = 0 initially, and you have two processors/queues.
 Initially tgt-req_vq is queue #1.  If you have this:
 
 queuecommand on CPU #0 queuecommand #2 on CPU #1
   --
 atomic_inc_return(...) == 1
atomic_inc_return(...) == 2
virtscsi_queuecommand to queue #1
 tgt-req_vq = queue #0
 virtscsi_queuecommand to queue #0
 
 then two requests are issued to different queues without a quiescent
 point in the middle.

What happens then? Does this break correctness?

  +if (atomic_inc_return(tgt-reqs) == 1) {
  +queue_num = smp_processor_id();
  +while (unlikely(queue_num = vscsi-num_queues))
  +queue_num -= vscsi-num_queues;
  +tgt-req_vq = vscsi-req_vqs[queue_num];
  +}
  +spin_unlock_irqrestore(tgt-tgt_lock, flags);
  +return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +
  +
  
  .
  
  +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
  +   struct scsi_cmnd *sc)
  +{
  +   struct virtio_scsi *vscsi = shost_priv(sh);
  +   struct virtio_scsi_target_state *tgt = 
  vscsi-tgt[sc-device-id];
  +
  +   atomic_inc(tgt-reqs);
  +   return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +
  
  Here, reqs is unused - why bother incrementing it?
  A branch on completion would be cheaper IMHO.
 
 Well, I could also let tgt-reqs go negative, but it would be a bit untidy.
 
 Another alternative is to access the target's target_busy field with
 ACCESS_ONCE, and drop reqs altogether.  Too tricky to do this kind of
 micro-optimization so early, though.

So keep it simple and just check a flag.

  virtio-scsi multiqueue has a performance benefit up to 20%
  
  To be fair, you could be running in single queue mode.
  In that case extra atomics and indirection that this code
  brings will just add overhead without benefits.
  I don't know how significant would that be.
 
 Not measurable in my experiments.
 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 13:09, Michael S. Tsirkin ha scritto:
  queuecommand on CPU #0 queuecommand #2 on CPU #1
--
  atomic_inc_return(...) == 1
 atomic_inc_return(...) == 2
 virtscsi_queuecommand to queue #1
  tgt-req_vq = queue #0
  virtscsi_queuecommand to queue #0
  
  then two requests are issued to different queues without a quiescent
  point in the middle.
 What happens then? Does this break correctness?

Yes, requests to the same target should be processed in FIFO order, or
you have things like a flush issued before the write it was supposed to
flush.  This is why I can only change the queue when there is no request
pending.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt-users] vm pxe fail

2012-09-04 Thread Alex Jia
- Original Message -
From: Avi Kivity a...@redhat.com
To: Alex Jia a...@redhat.com
Cc: Andrew Holway a.hol...@syseleven.de, kvm@vger.kernel.org
Sent: Monday, September 3, 2012 9:27:08 PM
Subject: Re: [libvirt-users] vm pxe fail

On 08/31/2012 05:37 PM, Alex Jia wrote:
 Hi Andrew,
 Great, BTW, in fact, you may pxe boot via VF of Intel82576, however, 
 Intel82576 SR-IOV network adapters 
 don't provide a ROM BIOS for the cards virtual functions (VF), but an image 
 of such a ROM is available, 
 and with this ROM visible to the guest, it can PXE boot.
 
 In libvirt's xml, you need to configure guest XML like this:
 
   hostdev mode='subsystem' type='pci' managed='yes'
 source
   address bus='XX' slot='XX' function='XX'/
 /source 
 boot order='1'/ 
rom bar='on' file='//ipxe-808610ca.rom'/
   /hostdev
 
 You need to build a ipxe-808610ca.rom by yourself, if you're interested in 
 this,
 please refer to http://ipxe.org/.

Is there a way to automate this?  Perhaps a database matching PCI IDs
and ipxe .roms, which qemu could consult?

   Hi Avi,
   Good question, I haven't try this via qemu yet, from libvirt POV, basically, 
we may filter and parse 'lspci'
   or 'virsh nodedev-list --tree' output to get a bus, slot and function number 
then add them into above guest
   XML, WRT above 'ipxe-808610ca.rom' file, we may directly 'git clone 
git://git.ipxe.org/ipxe.git' then compile
   it and generate a .rom file such as 82576.rom or use a vendor+product id as 
a rom name if you like.

   Regards,
   Alex

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt-users] vm pxe fail

2012-09-04 Thread Avi Kivity
On 09/04/2012 02:31 PM, Alex Jia wrote:
 - Original Message -
 From: Avi Kivity a...@redhat.com
 To: Alex Jia a...@redhat.com
 Cc: Andrew Holway a.hol...@syseleven.de, kvm@vger.kernel.org
 Sent: Monday, September 3, 2012 9:27:08 PM
 Subject: Re: [libvirt-users] vm pxe fail
 
 On 08/31/2012 05:37 PM, Alex Jia wrote:
 Hi Andrew,
 Great, BTW, in fact, you may pxe boot via VF of Intel82576, however, 
 Intel82576 SR-IOV network adapters 
 don't provide a ROM BIOS for the cards virtual functions (VF), but an image 
 of such a ROM is available, 
 and with this ROM visible to the guest, it can PXE boot.
 
 In libvirt's xml, you need to configure guest XML like this:
 
   hostdev mode='subsystem' type='pci' managed='yes'
 source
   address bus='XX' slot='XX' function='XX'/
 /source 
 boot order='1'/ 
rom bar='on' file='//ipxe-808610ca.rom'/
   /hostdev
 
 You need to build a ipxe-808610ca.rom by yourself, if you're interested in 
 this,
 please refer to http://ipxe.org/.
 
 Is there a way to automate this?  Perhaps a database matching PCI IDs
 and ipxe .roms, which qemu could consult?
 
Hi Avi,
Good question, I haven't try this via qemu yet, from libvirt POV, 
 basically, we may filter and parse 'lspci'
or 'virsh nodedev-list --tree' output to get a bus, slot and function 
 number then add them into above guest
XML, WRT above 'ipxe-808610ca.rom' file, we may directly 'git clone 
 git://git.ipxe.org/ipxe.git' then compile
it and generate a .rom file such as 82576.rom or use a vendor+product id 
 as a rom name if you like.

We could have qemu autoload /usr/share/qemu/roms/vendor-device.rom, and
symlink /usr/share/qemu/roms to /usr/share/ipxe/roms or something.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] Making KVM_GET_ONE_REG/KVM_SET_ONE_REG generic.

2012-09-04 Thread Avi Kivity
On 09/03/2012 03:33 PM, Rusty Russell wrote:
 Avi Kivity a...@redhat.com writes:
 On 09/01/2012 03:35 PM, Rusty Russell wrote:
 Passing an address in a struct is pretty bad, since it involves
 compatibility wrappers.  

 Right, some s390 thing.
 
 Err, no, i386 on x86-64, or ppc32 on ppc64, or arm on arm64
 
 Any time you put a pointer in a structure which is exposed to userspace,
 you have to deal with this.

Not is you pack the pointer in a __u64, which is what we do to preserve
padding.  Then it is only s390 which needs extra love.

 I don't think that is what makes the API hard
 to use.

 What is it then?  I forgot what the original complaints/complainers were.
 
 I have no idea, since I didn't hear the complaints.  But any non-fixed
 size array has issues in C; there's not much we can do about it.
 
 x86 manages this fine for msrs, and I didn't have a problem using it for
 my test programs.  That's the limit of my experience, however.

Another option is to use the size parameter from the ioctl.  It just
sits there doing nothing.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] KVM: cleanup pic reset

2012-09-04 Thread Avi Kivity
On 09/03/2012 02:47 PM, Gleb Natapov wrote:
 kvm_pic_reset() is not used anywhere. Move reset logic from
 pic_ioport_write() there.

Applied, thanks.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Check INVPCID feature bit in EBX of leaf 7

2012-09-04 Thread Avi Kivity
On 09/01/2012 11:12 AM, Mao, Junjie wrote:
 Checks and operations on the INVPCID feature bit should use EBX of CPUID leaf 
 7
 instead of ECX.
 
 Signed-off-by: Junjie Mao junjie@intel.com
 ---
  arch/x86/kvm/vmx.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index c00f03d..002b4a5 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -6575,7 +6575,7 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 /* Exposing INVPCID only when PCID is exposed */
 best = kvm_find_cpuid_entry(vcpu, 0x7, 0);
 if (vmx_invpcid_supported() 
 -   best  (best-ecx  bit(X86_FEATURE_INVPCID)) 
 +   best  (best-ebx  bit(X86_FEATURE_INVPCID)) 
 guest_cpuid_has_pcid(vcpu)) {
 exec_control |= SECONDARY_EXEC_ENABLE_INVPCID;
 vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
 @@ -6585,7 +6585,7 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
  exec_control);
 if (best)
 -   best-ecx = ~bit(X86_FEATURE_INVPCID);
 +   best-ebx = ~bit(X86_FEATURE_INVPCID);
 }
  }
 

Patch is whitespace damaged, please fix.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Restore cr3 after tests on PCID

2012-09-04 Thread Avi Kivity
On 09/01/2012 11:12 AM, Mao, Junjie wrote:
 The INVPCID enabling test assumes cr3[11:0] is 0. But at present PCID enabling
 test sets cr3[11:0] to 1 for its own purpose and doesn't restore the register,
 which leads to a failure when INVPCID test tries to enable PCIDE.
 
 This patch restores cr3 after PCID enabling test is done so that PCIDE can be
 enabled normally in later tests.

Thanks, applied.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/1] kvm: Use vcpu_id as pivot instead of last boosted vcpu in PLE handler

2012-09-04 Thread Raghavendra K T

On 09/02/2012 09:59 PM, Rik van Riel wrote:

On 09/02/2012 06:12 AM, Gleb Natapov wrote:

On Thu, Aug 30, 2012 at 12:51:01AM +0530, Raghavendra K T wrote:

The idea of starting from next vcpu (source of yield_to + 1) seem to
work
well for overcomitted guest rather than using last boosted vcpu. We
can also
remove per VM variable with this approach.

Iteration for eligible candidate after this patch starts from vcpu
source+1
and ends at source-1 (after wrapping)

Thanks Nikunj for his quick verification of the patch.

Please let me know if this patch is interesting and makes sense.


This last_boosted_vcpu thing caused us trouble during attempt to
implement vcpu destruction. It is good to see it removed from this POV.


I like this implementation. It should achieve pretty much
the same as my old code, but without the downsides and without
having to keep the same amount of global state.



My theoretical understanding how it would help is,

  |
  V
T0 --- T1

suppose there are 4 vcpus (v1..v4) out of 32/64 vcpus simpultaneously 
enter directed yield handler,


if last_boosted_vcpu = i then v1 .. v4 will start from i, and there may
be some unnecessary attempts for directed yields.

We may not see such attempts with above patch. But again I agree that,
whole directed_yield stuff itself is very complicated because of 
possibility of each vcpu in different state (running/pauseloop exited 
while spinning/eligible)  and how they are located w.r.t each other.


Here is the result I got for ebizzy, 32 vcpu guest 32 core PLE machine
for 1x 2x and 3x overcommits.

base = 3.5-rc5 kernel with ple handler improvements patches applied
patched = base + vcpuid patch

 base stdev   patched stdev   %improvement
1x  1955.625039.89611863.375037.8302-4.71716
2x  2475.3750   165.03073078.8750   341.950024.38014
3x  2071.555691.53702112.666756.6171 1.98455

Note:
I have to admit that, I am seeing very inconsistent results while 
experimenting with 3.6-rc kernel (not specific to vcpuid patch but as a 
whole) but not sure if it is some thing wrong in my config or should I 
spend some time debugging. Anybody has observed same?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA small block DIRECT_IO bug

2012-09-04 Thread Myklebust, Trond
On Tue, 2012-09-04 at 11:31 +0200, Andrew Holway wrote:
 Hello.
 
 # Avi Kivity avi(a)redhat recommended I copy kvm in on this. It would also 
 seem relevent to libvirt. #
 
 I have a Centos 6.2 server and Centos 6.2 client.
 
 [root@store ~]# cat /etc/exports 
 /dev/shm  
 10.149.0.0/16(rw,fsid=1,no_root_squash,insecure)(I have tried with non 
 tempfs targets also)
 
 
 [root@node001 ~]# cat /etc/fstab 
 store.ibnet:/dev/shm /mnt nfs  
 rdma,port=2050,defaults 0 0
 
 
 I wrote a little for loop one liner that dd'd the centos net install image to 
 a file called 'hello' then checksummed that file. Each iteration uses a 
 different block size.
 
 Non DIRECT_IO seems to work fine. DIRECT_IO with 512byte, 1K and 2K block 
 sizes get corrupted.


That is expected behaviour. DIRECT_IO over RDMA needs to be page aligned
so that it can use the more efficient RDMA READ and RDMA WRITE memory
semantics (instead of the SEND/RECEIVE channel semantics).

 I want to run my KVM guests on top of NFS over RDMA. My guests cannot create 
 filesystems.
 
 Thanks,
 
 Andrew.
 
 bug report: https://bugzilla.linux-nfs.org/show_bug.cgi?id=228
 
 [root@node001 mnt]# for f in 512 1024 2048 4096 8192 16384 32768 65536 
 131072; do dd bs=$f if=CentOS-6.3-x86_64-netinstall.iso of=hello 
 iflag=direct oflag=direct  md5sum hello  rm -f hello; done
 
 409600+0 records in
 409600+0 records out
 209715200 bytes (210 MB) copied, 62.3649 s, 3.4 MB/s
 aadd0ffe3c9dfa35d8354e99ecac9276  hello -- 512 byte block 
 
 204800+0 records in
 204800+0 records out
 209715200 bytes (210 MB) copied, 41.3876 s, 5.1 MB/s
 336f6da78f93dab591edc18da81f002e  hello -- 1K block
 
 102400+0 records in
 102400+0 records out
 209715200 bytes (210 MB) copied, 21.1712 s, 9.9 MB/s
 f4cefe0a05c9b47ba68effdb17dc95d6  hello -- 2k block
 
 51200+0 records in
 51200+0 records out
 209715200 bytes (210 MB) copied, 10.9631 s, 19.1 MB/s
 690138908de516b6e5d7d180d085c3f3  hello -- 4k block
 
 25600+0 records in
 25600+0 records out
 209715200 bytes (210 MB) copied, 5.4136 s, 38.7 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 12800+0 records in
 12800+0 records out
 209715200 bytes (210 MB) copied, 3.1448 s, 66.7 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 6400+0 records in
 6400+0 records out
 209715200 bytes (210 MB) copied, 1.77304 s, 118 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 3200+0 records in
 3200+0 records out
 209715200 bytes (210 MB) copied, 1.4331 s, 146 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 1600+0 records in
 1600+0 records out
 209715200 bytes (210 MB) copied, 0.922167 s, 227 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com



Re: [PATCH 2/8] KVM: x86 emulator: use aligned variants of SSE register ops

2012-09-04 Thread Avi Kivity
On 08/30/2012 02:30 AM, Mathias Krause wrote:
 As the the compiler ensures that the memory operand is always aligned
 to a 16 byte memory location, 

I'm not sure it does.  Is V4SI aligned?  Do we use alignof() to
propagate the alignment to the vcpu allocation code?

 use the aligned variant of MOVDQ for
 read_sse_reg() and write_sse_reg().
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 1451cff..5a0fee1 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -909,23 +909,23 @@ static void read_sse_reg(struct x86_emulate_ctxt *ctxt, 
 sse128_t *data, int reg)
  {
   ctxt-ops-get_fpu(ctxt);
   switch (reg) {
 - case 0: asm(movdqu %%xmm0, %0 : =m(*data)); break;
 - case 1: asm(movdqu %%xmm1, %0 : =m(*data)); break;
 - case 2: asm(movdqu %%xmm2, %0 : =m(*data)); break;
 - case 3: asm(movdqu %%xmm3, %0 : =m(*data)); break;
 - case 4: asm(movdqu %%xmm4, %0 : =m(*data)); break;
 - case 5: asm(movdqu %%xmm5, %0 : =m(*data)); break;
 - case 6: asm(movdqu %%xmm6, %0 : =m(*data)); break;
 - case 7: asm(movdqu %%xmm7, %0 : =m(*data)); break;
 + case 0: asm(movdqa %%xmm0, %0 : =m(*data)); break;
 + case 1: asm(movdqa %%xmm1, %0 : =m(*data)); break;
 + case 2: asm(movdqa %%xmm2, %0 : =m(*data)); break;
 + case 3: asm(movdqa %%xmm3, %0 : =m(*data)); break;
 + case 4: asm(movdqa %%xmm4, %0 : =m(*data)); break;
 + case 5: asm(movdqa %%xmm5, %0 : =m(*data)); break;
 + case 6: asm(movdqa %%xmm6, %0 : =m(*data)); break;
 + case 7: asm(movdqa %%xmm7, %0 : =m(*data)); break;
  #ifdef CONFIG_X86_64
 - case 8: asm(movdqu %%xmm8, %0 : =m(*data)); break;
 - case 9: asm(movdqu %%xmm9, %0 : =m(*data)); break;
 - case 10: asm(movdqu %%xmm10, %0 : =m(*data)); break;
 - case 11: asm(movdqu %%xmm11, %0 : =m(*data)); break;
 - case 12: asm(movdqu %%xmm12, %0 : =m(*data)); break;
 - case 13: asm(movdqu %%xmm13, %0 : =m(*data)); break;
 - case 14: asm(movdqu %%xmm14, %0 : =m(*data)); break;
 - case 15: asm(movdqu %%xmm15, %0 : =m(*data)); break;
 + case 8: asm(movdqa %%xmm8, %0 : =m(*data)); break;
 + case 9: asm(movdqa %%xmm9, %0 : =m(*data)); break;
 + case 10: asm(movdqa %%xmm10, %0 : =m(*data)); break;
 + case 11: asm(movdqa %%xmm11, %0 : =m(*data)); break;
 + case 12: asm(movdqa %%xmm12, %0 : =m(*data)); break;
 + case 13: asm(movdqa %%xmm13, %0 : =m(*data)); break;
 + case 14: asm(movdqa %%xmm14, %0 : =m(*data)); break;
 + case 15: asm(movdqa %%xmm15, %0 : =m(*data)); break;
  #endif
   default: BUG();


The vmexit costs dominates any win here by several orders of magnitude.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] KVM: x86 emulator: use aligned variants of SSE register ops

2012-09-04 Thread Avi Kivity
On 09/04/2012 03:09 PM, Avi Kivity wrote:
 On 08/30/2012 02:30 AM, Mathias Krause wrote:
 As the the compiler ensures that the memory operand is always aligned
 to a 16 byte memory location, 
 
 I'm not sure it does.  Is V4SI aligned?  Do we use alignof() to
 propagate the alignment to the vcpu allocation code?

We actually do.  But please rebase the series against next, I got some
conflicts while applying.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PIC: fix use of uninitialised variable.

2012-09-04 Thread Avi Kivity
On 08/30/2012 01:32 PM, Jamie Iles wrote:
 Commit aea218f3cbbc (KVM: PIC: call ack notifiers for irqs that are
 dropped form irr) used an uninitialised variable to track whether an
 appropriate apic had been found.  This could result in calling the ack
 notifier incorrectly.

Thanks, applied to master for 3.6.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Aug 28, 2012 at 01:54:17PM +0200, Paolo Bonzini wrote:
 This patch adds queue steering to virtio-scsi.  When a target is sent
 multiple requests, we always drive them to the same queue so that FIFO
 processing order is kept.  However, if a target was idle, we can choose
 a queue arbitrarily.  In this case the queue is chosen according to the
 current VCPU, so the driver expects the number of request queues to be
 equal to the number of VCPUs.  This makes it easy and fast to select
 the queue, and also lets the driver optimize the IRQ affinity for the
 virtqueues (each virtqueue's affinity is set to the CPU that owns
 the queue).
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

I guess an alternative is a per-target vq.
Is the reason you avoid this that you expect more targets
than cpus? If yes this is something you might want to
mention in the log.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] KVM: x86 emulator: use aligned variants of SSE register ops

2012-09-04 Thread Mathias Krause
On Tue, Sep 4, 2012 at 2:13 PM, Avi Kivity a...@redhat.com wrote:
 On 09/04/2012 03:09 PM, Avi Kivity wrote:
 On 08/30/2012 02:30 AM, Mathias Krause wrote:
 As the the compiler ensures that the memory operand is always aligned
 to a 16 byte memory location,

 I'm not sure it does.  Is V4SI aligned?  Do we use alignof() to
 propagate the alignment to the vcpu allocation code?

I checked that to by introducing a dummy char member in struct operand
that would have misaligned vec_val but, indeed, the compiler ensured
it's still 16 byte aligned.


 We actually do.  But please rebase the series against next, I got some
 conflicts while applying.

If next means kvm/next
(i.e.git://git.kernel.org/pub/scm/virt/kvm/kvm.git#next) here, the
whole series applies cleanly for me.
HEAD in kvm/next is 9a78197 KVM: x86: remove unused variable from
kvm_task_switch() here. Albeit the series was build against kvm/next
at the time as a81aba1 KVM: VMX: Ignore segment G and D bits when
considering whether we can virtualize was HEAD in this branch.

Could you please retry and show me the conflicts you get?


Regards,
Mathias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ping latency using vhost_net, macvtap and virtio

2012-09-04 Thread Avi Kivity
On 08/29/2012 11:34 AM, Pozsár Balázs wrote:
 
 Hi all,
 
 I have been testing network throughput and latency and I was wondering
 if my measurements are as expected.
 For the test, I used Fedora 17 for both host and guest, using kernel
 3.5.2-3.fc17.86_64.
 
 Pinging an external server on the LAN from the host, using a gigabit
 interface, the results are:
 # ping -c 10 172.16.1.1
 PING 172.16.1.1 (172.16.1.1) 56(84) bytes of data.
 64 bytes from 172.16.1.1: icmp_req=1 ttl=64 time=0.109 ms
 64 bytes from 172.16.1.1: icmp_req=2 ttl=64 time=0.131 ms
 64 bytes from 172.16.1.1: icmp_req=3 ttl=64 time=0.145 ms
 64 bytes from 172.16.1.1: icmp_req=4 ttl=64 time=0.116 ms
 64 bytes from 172.16.1.1: icmp_req=5 ttl=64 time=0.110 ms
 64 bytes from 172.16.1.1: icmp_req=6 ttl=64 time=0.114 ms
 64 bytes from 172.16.1.1: icmp_req=7 ttl=64 time=0.112 ms
 64 bytes from 172.16.1.1: icmp_req=8 ttl=64 time=0.117 ms
 64 bytes from 172.16.1.1: icmp_req=9 ttl=64 time=0.119 ms
 64 bytes from 172.16.1.1: icmp_req=10 ttl=64 time=0.128 ms
 
 --- 172.16.1.1 ping statistics ---
 10 packets transmitted, 10 received, 0% packet loss, time 8999ms
 rtt min/avg/max/mdev = 0.109/0.120/0.145/0.011 ms
 
 
 Pinging the same external host on the LAN from the guest, the latency
 seems to be much higher:
 # ping -c 10 172.16.1.1
 PING 172.16.1.1 (172.16.1.1) 56(84) bytes of data.
 64 bytes from 172.16.1.1: icmp_req=1 ttl=64 time=0.206 ms
 64 bytes from 172.16.1.1: icmp_req=2 ttl=64 time=0.352 ms
 64 bytes from 172.16.1.1: icmp_req=3 ttl=64 time=0.518 ms
 64 bytes from 172.16.1.1: icmp_req=4 ttl=64 time=0.351 ms
 64 bytes from 172.16.1.1: icmp_req=5 ttl=64 time=0.543 ms
 64 bytes from 172.16.1.1: icmp_req=6 ttl=64 time=0.387 ms
 64 bytes from 172.16.1.1: icmp_req=7 ttl=64 time=0.348 ms
 64 bytes from 172.16.1.1: icmp_req=8 ttl=64 time=0.364 ms
 64 bytes from 172.16.1.1: icmp_req=9 ttl=64 time=0.345 ms
 64 bytes from 172.16.1.1: icmp_req=10 ttl=64 time=0.334 ms
 
 --- 172.16.1.1 ping statistics ---
 10 packets transmitted, 10 received, 0% packet loss, time 8999ms
 rtt min/avg/max/mdev = 0.206/0.374/0.543/0.093 ms
 
 
 The LAN, the host and guest are idle otherwise during the tests.
 There are no iptables rules active.
 The vhost_net and macvtap modules are loaded on the host, and qemu was
 started (by libvirtd) with the -netdev vhost=on option.
 The guest is using the virtio_net driver.
 
 Is this expected and normal, or do others see better latencies? Can I
 try anything to make it better?

We've seen this, at least in once case the problem is due to the extra
threads needed for virtualization; each one of them sits on a core, and
if that core is in deep C state it will take quite a while to wake up.

You can verify this by booting the host with idle=poll on the kernel
command line, or simply running some load in the background.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA small block DIRECT_IO bug

2012-09-04 Thread Andrew Holway
 
 That is expected behaviour. DIRECT_IO over RDMA needs to be page aligned
 so that it can use the more efficient RDMA READ and RDMA WRITE memory
 semantics (instead of the SEND/RECEIVE channel semantics).

Yes, I think I am understanding that now.

I need to find a way of getting around the lib-virt issue.

http://lists.gnu.org/archive/html/qemu-devel/2011-12/msg01570.html

Thanks,

Andrew


 
 I want to run my KVM guests on top of NFS over RDMA. My guests cannot create 
 filesystems.
 
 Thanks,
 
 Andrew.
 
 bug report: https://bugzilla.linux-nfs.org/show_bug.cgi?id=228
 
 [root@node001 mnt]# for f in 512 1024 2048 4096 8192 16384 32768 65536 
 131072; do dd bs=$f if=CentOS-6.3-x86_64-netinstall.iso of=hello 
 iflag=direct oflag=direct  md5sum hello  rm -f hello; done
 
 409600+0 records in
 409600+0 records out
 209715200 bytes (210 MB) copied, 62.3649 s, 3.4 MB/s
 aadd0ffe3c9dfa35d8354e99ecac9276  hello -- 512 byte block 
 
 204800+0 records in
 204800+0 records out
 209715200 bytes (210 MB) copied, 41.3876 s, 5.1 MB/s
 336f6da78f93dab591edc18da81f002e  hello -- 1K block
 
 102400+0 records in
 102400+0 records out
 209715200 bytes (210 MB) copied, 21.1712 s, 9.9 MB/s
 f4cefe0a05c9b47ba68effdb17dc95d6  hello -- 2k block
 
 51200+0 records in
 51200+0 records out
 209715200 bytes (210 MB) copied, 10.9631 s, 19.1 MB/s
 690138908de516b6e5d7d180d085c3f3  hello -- 4k block
 
 25600+0 records in
 25600+0 records out
 209715200 bytes (210 MB) copied, 5.4136 s, 38.7 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 12800+0 records in
 12800+0 records out
 209715200 bytes (210 MB) copied, 3.1448 s, 66.7 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 6400+0 records in
 6400+0 records out
 209715200 bytes (210 MB) copied, 1.77304 s, 118 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 3200+0 records in
 3200+0 records out
 209715200 bytes (210 MB) copied, 1.4331 s, 146 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 1600+0 records in
 1600+0 records out
 209715200 bytes (210 MB) copied, 0.922167 s, 227 MB/s
 690138908de516b6e5d7d180d085c3f3  hello
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 Trond Myklebust
 Linux NFS client maintainer
 
 NetApp
 trond.mykleb...@netapp.com
 www.netapp.com
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: expanding virtual disk based on lvm

2012-09-04 Thread Avi Kivity
On 08/28/2012 11:26 PM, Ross Boylan wrote:
 My vm launches with -hda /dev/turtle/VD0 -hdb /dev/turtle/VD1, where VD0
 and VD1 are lvm logical volumes.  I used lvextend to expand them, but
 the VM, started after the expansion, does not seem to see the extra
 space.
 
 What do I need to so that the space will be recognized?

IDE (-hda) does not support rechecking the size.  Try booting with
virtio-blk.  Additionally, you may need to request the guest to rescan
the drive (no idea how to do that).  Nor am I sure whether qemu will
emulate the request correctly.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm tools: Export DISPLAY ENV as our default host ip address

2012-09-04 Thread Avi Kivity
On 08/24/2012 02:29 PM, Asias He wrote:
 It is useful to run a X program in guest and display it on host.
 
 1) Make host's x server listen to localhost:6000
host_shell$ socat -d -d TCP-LISTEN:6000,fork,bind=localhost \
UNIX-CONNECT:/tmp/.X11-unix/X0
 
 2) Start the guest and run X program
host_shell$ lkvm run -k /boot/bzImage
   guest_shell$ xlogo
 

Note, this is insecure, don't do this with untrusted guests.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM-enabled Linux 3.2 won't boot in kvm

2012-09-04 Thread Avi Kivity
On 08/24/2012 06:14 AM, Neal Murphy wrote:
 On Saturday 18 August 2012 22:04:20 Neal Murphy wrote:
 I've been using KVM for a few years now. I've had little trouble with it.
 But now it's got me treed. I cannot get a KVM-enabled Linux 3.2.27 kernel
 to boot in qemu-kvm unless I specify '-no-kvm'. I've used a
 similarly-built and - configured 2.6.35 kernel without trouble.
 
 ...
 
 This is on Debian Squeeze, either using Debian's kvm package or using a
 freshly built kvm 1.1.1. The kernel does boot using qemu or booting on real
 hardware. And I've no trouble booting Linux 2.6.35.
 
 Using Squebian's 2.6.32-5-686-bigmem, the 3.[024] kernels I built (with KVM) 
 don't boot.
 
 Trying Squebian's 2.6.32-5-amd64. ... 3.0.41 and 3.4.9 now boot 
 using either Squebian's kvm or qemu-kvm v1.1.1.
 
 Trying Squebian's 2.6.32-5-686. ... And the 3.0 and 3.4 kernels boot using 
 either version of qemu-kvm.
 
 Does that help narrow the problem?

No.  Please provide the guest's serial log.  Also run 'top' and
'kvm_stat' on the host to see what the guest is doing.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 5/5] KVM: ARM: Access all registers via KVM_GET_ONE_REG/KVM_SET_ONE_REG.

2012-09-04 Thread Peter Maydell
On 1 September 2012 20:40, Christoffer Dall
c.d...@virtualopensystems.com wrote:
 On Sep 1, 2012, at 6:25 AM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 1 September 2012 10:16, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 11:21 AM, Rusty Russell wrote:
 Peter Maydell wrote:
 ...but if we do go this path, you can't use coprocessor 0
 to mean core register -- cp0 could be a valid coprocessor
 (the ARM ARM reserves cp0..cp7 for vendor specific features).
 Use something outside 0..15.

 OK, changed that too (16).

 And tomorrow they will add 16.

 Not possible in the instruction encoding :-) We haven't used
 anywhere near all the coprocessors (even given we've let the
 vendors have 0..7, ARM itself uses only 10 and 11 for the FPU,
 14 for debug/perf and 15 for system control (and 14 and 15 still
 have lots of spare space).

 Yeah, but folding core registers under coprocessors feels just
 too fishy, so I think we should have a separate field.

I never really thought of the top half of the index encoding
as being particularly a coprocessor-number specific thing in
the first place. It's just 16 bits of what is this thing
anyway?, where each coprocessor gets a bit of the space, and
so will the GIC, and the VFP regs, and so on. We just happened
to use 0..15 of the what is this? space for cp0..cp15.

(Incidentally, the term coprocessor is now basically just a
historical artefact. The bits of the CPU you get at via the
coprocessor registers and instruction encoding space are
not separate functional units, they're part of the core.)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread Juan Quintela
Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 04:35 PM, Jan Kiszka wrote:
 On 2012-09-03 13:48, Avi Kivity wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.
 
 - protecting MemoryRegion::opaque during dispatch
 
 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).
 
 Sorry, I'm blocked right at that time.

 NP, will continue on the list.

My understanding is that this topic has been re-called, so we don't have
topics for Today agenda.

Avi?

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread Avi Kivity
On 09/04/2012 04:16 PM, Juan Quintela wrote:
 Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 04:35 PM, Jan Kiszka wrote:
 On 2012-09-03 13:48, Avi Kivity wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.
 
 - protecting MemoryRegion::opaque during dispatch
 
 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).
 
 Sorry, I'm blocked right at that time.

 NP, will continue on the list.
 
 My understanding is that this topic has been re-called, so we don't have
 topics for Today agenda.
 
 Avi?

Correct.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread Juan Quintela
Juan Quintela quint...@redhat.com wrote:
 Hi

 Please send in any agenda items you are interested in covering.

As the memory region protection topic has been re-called, there are no
topic for this week.  Call gets cancelled.

Have a nice week, Juan.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] Making KVM_GET_ONE_REG/KVM_SET_ONE_REG generic.

2012-09-04 Thread Peter Maydell
On 1 September 2012 13:28, Rusty Russell ru...@rustcorp.com.au wrote:
 Rusty Russell (8):
   KVM: ARM: Fix walk_msrs()
   KVM: Move KVM_SET_ONE_REG/KVM_GET_ONE_REG to generic code.
   KVM: Add KVM_REG_SIZE() helper.
   KVM: ARM: use KVM_SET_ONE_REG/KVM_GET_ONE_REG.
   KVM: Add KVM_VCPU_GET_REG_LIST.
   KVM: ARM: Use KVM_VCPU_GET_REG_LIST.
   KVM: ARM: Access all registers via KVM_GET_ONE_REG/KVM_SET_ONE_REG.
   KVM ARM: Update api.txt

So I was thinking about this, and I remembered that the SET_ONE_REG/
GET_ONE_REG API has userspace pass a pointer to the variable the
kernel should read/write (unlike the _MSR x86 ioctls, where the
actual data value is sent back and forth in the struct). Further,
the kernel only writes a data value of the size of the register
(rather than always reading/writing a uint64_t).

This is a problem because it means userspace needs to know the
size of each register, and the kernel doesn't provide any way
to determine the size. This defeats the idea that userspace should
be able to migrate kernel register state without having to know
the semantics of all the registers involved.

Possible solutions:
 * switch GET/SET_ONE_REG to just passing data, same as the MSR ioctls
 * switch GET/SET_ONE_REG to always writing 64 bits regardless of
   actual guest register width
 * make GET_REG_LIST return register width as well as index

Personally I would really prefer the MSR-style pass the data.
Otherwise I'm going to end up constructing something like
 uint64_t actual_values[]
 struct kvm_one_reg regs[]

where regs[x].addr = actual_values[x] for all x. Which seems
like unnecessary indirection really :-)

I could live with always read/write 64 bits. I definitely don't
want to have to deal with matching up register widths to accesses
in userspace, please.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 01:18:31PM +0200, Paolo Bonzini wrote:
 Il 04/09/2012 13:09, Michael S. Tsirkin ha scritto:
   queuecommand on CPU #0 queuecommand #2 on CPU #1
 --
   atomic_inc_return(...) == 1
  atomic_inc_return(...) == 2
  virtscsi_queuecommand to queue #1
   tgt-req_vq = queue #0
   virtscsi_queuecommand to queue #0
   
   then two requests are issued to different queues without a quiescent
   point in the middle.
  What happens then? Does this break correctness?
 
 Yes, requests to the same target should be processed in FIFO order, or
 you have things like a flush issued before the write it was supposed to
 flush.  This is why I can only change the queue when there is no request
 pending.
 
 Paolo

I see.  I guess you can rewrite this as:
atomic_inc
if (atomic_read() == 1)
which is a bit cheaper, and make the fact
that you do not need increment and return to be atomic,
explicit.

Another simple idea: store last processor id in target,
if it is unchanged no need to play with req_vq
and take spinlock.

Also - some kind of comment explaining why a similar race can not happen
with this lock in place would be nice: I see why this specific race can
not trigger but since lock is dropped later before you submit command, I
have hard time convincing myself what exactly gurantees that vq is never
switched before or even while command is submitted.




-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 15:35, Michael S. Tsirkin ha scritto:
 I see.  I guess you can rewrite this as:
 atomic_inc
 if (atomic_read() == 1)
 which is a bit cheaper, and make the fact
 that you do not need increment and return to be atomic,
 explicit.

It seems more complicated to me for hardly any reason.  (Besides, is it
cheaper?  It has one less memory barrier on some architectures I frankly
do not care much about---not on x86---but it also has two memory
accesses instead of one on all architectures).

 Another simple idea: store last processor id in target,
 if it is unchanged no need to play with req_vq
 and take spinlock.

Not so sure, consider the previous example with last_processor_id equal
to 1.

queuecommand on CPU #0 queuecommand #2 on CPU #1
  --
atomic_inc_return(...) == 1
   atomic_inc_return(...) == 2
   virtscsi_queuecommand to queue #1
last_processor_id == 0? no
spin_lock
tgt-req_vq = queue #0
spin_unlock
virtscsi_queuecommand to queue #0

This is not a network driver, there are still a lot of locks around.
This micro-optimization doesn't pay enough for the pain.

 Also - some kind of comment explaining why a similar race can not happen
 with this lock in place would be nice: I see why this specific race can
 not trigger but since lock is dropped later before you submit command, I
 have hard time convincing myself what exactly gurantees that vq is never
 switched before or even while command is submitted.

Because tgt-reqs will never become zero (which is a necessary condition
for tgt-req_vq to change), as long as one request is executing
virtscsi_queuecommand.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 14:48, Michael S. Tsirkin ha scritto:
  This patch adds queue steering to virtio-scsi.  When a target is sent
  multiple requests, we always drive them to the same queue so that FIFO
  processing order is kept.  However, if a target was idle, we can choose
  a queue arbitrarily.  In this case the queue is chosen according to the
  current VCPU, so the driver expects the number of request queues to be
  equal to the number of VCPUs.  This makes it easy and fast to select
  the queue, and also lets the driver optimize the IRQ affinity for the
  virtqueues (each virtqueue's affinity is set to the CPU that owns
  the queue).
  
  Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 I guess an alternative is a per-target vq.
 Is the reason you avoid this that you expect more targets
 than cpus? If yes this is something you might want to
 mention in the log.

One reason is that, even though in practice I expect roughly the same
number of targets and VCPUs, hotplug means the number of targets is
difficult to predict and is usually fixed to 256.

The other reason is that per-target vq didn't give any performance
advantage.  The bonus comes from cache locality and less process
migrations, more than from the independent virtqueues.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 03:45:57PM +0200, Paolo Bonzini wrote:
  Also - some kind of comment explaining why a similar race can not happen
  with this lock in place would be nice: I see why this specific race can
  not trigger but since lock is dropped later before you submit command, I
  have hard time convincing myself what exactly gurantees that vq is never
  switched before or even while command is submitted.
 
 Because tgt-reqs will never become zero (which is a necessary condition
 for tgt-req_vq to change), as long as one request is executing
 virtscsi_queuecommand.
 
 Paolo

Yes but this logic would apparently imply the lock is not necessary, and
it actually is. I am not saying anything is wrong just that it
looks scary.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 03:49:42PM +0200, Paolo Bonzini wrote:
 Il 04/09/2012 14:48, Michael S. Tsirkin ha scritto:
   This patch adds queue steering to virtio-scsi.  When a target is sent
   multiple requests, we always drive them to the same queue so that FIFO
   processing order is kept.  However, if a target was idle, we can choose
   a queue arbitrarily.  In this case the queue is chosen according to the
   current VCPU, so the driver expects the number of request queues to be
   equal to the number of VCPUs.  This makes it easy and fast to select
   the queue, and also lets the driver optimize the IRQ affinity for the
   virtqueues (each virtqueue's affinity is set to the CPU that owns
   the queue).
   
   Signed-off-by: Paolo Bonzini pbonz...@redhat.com
  I guess an alternative is a per-target vq.
  Is the reason you avoid this that you expect more targets
  than cpus? If yes this is something you might want to
  mention in the log.
 
 One reason is that, even though in practice I expect roughly the same
 number of targets and VCPUs, hotplug means the number of targets is
 difficult to predict and is usually fixed to 256.
 
 The other reason is that per-target vq didn't give any performance
 advantage.  The bonus comes from cache locality and less process
 migrations, more than from the independent virtqueues.
 
 Paolo

Okay, and why is per-target worse for cache locality?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 16:19, Michael S. Tsirkin ha scritto:
   Also - some kind of comment explaining why a similar race can not happen
   with this lock in place would be nice: I see why this specific race can
   not trigger but since lock is dropped later before you submit command, I
   have hard time convincing myself what exactly gurantees that vq is never
   switched before or even while command is submitted.
  
  Because tgt-reqs will never become zero (which is a necessary condition
  for tgt-req_vq to change), as long as one request is executing
  virtscsi_queuecommand.
 
 Yes but this logic would apparently imply the lock is not necessary, and
 it actually is. I am not saying anything is wrong just that it
 looks scary.

Ok, I get the misunderstanding.  For the logic to hold, you need a
serialization point after which tgt-req_vq is not changed.  The lock
provides one such serialization point: after you unlock tgt-tgt_lock,
nothing else will change tgt-req_vq until your request completes.

Without the lock, there could always be a thread that is in the then
branch but has been scheduled out, and when rescheduled it will change
tgt-req_vq.

Perhaps the confusion comes from the atomic_inc_return, and that was
what my why is this atomic wanted to clear.  **tgt-reqs is only
atomic to avoid taking a spinlock in the ISR**.  If you read the code
with the lock, but with tgt-reqs as a regular non-atomic int, it should
be much easier to reason on the code.  I can split the patch if needed.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 5/5] KVM: ARM: Access all registers via KVM_GET_ONE_REG/KVM_SET_ONE_REG.

2012-09-04 Thread Christoffer Dall
On Tue, Sep 4, 2012 at 9:09 AM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 1 September 2012 20:40, Christoffer Dall
 c.d...@virtualopensystems.com wrote:
 On Sep 1, 2012, at 6:25 AM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 1 September 2012 10:16, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 11:21 AM, Rusty Russell wrote:
 Peter Maydell wrote:
 ...but if we do go this path, you can't use coprocessor 0
 to mean core register -- cp0 could be a valid coprocessor
 (the ARM ARM reserves cp0..cp7 for vendor specific features).
 Use something outside 0..15.

 OK, changed that too (16).

 And tomorrow they will add 16.

 Not possible in the instruction encoding :-) We haven't used
 anywhere near all the coprocessors (even given we've let the
 vendors have 0..7, ARM itself uses only 10 and 11 for the FPU,
 14 for debug/perf and 15 for system control (and 14 and 15 still
 have lots of spare space).

 Yeah, but folding core registers under coprocessors feels just
 too fishy, so I think we should have a separate field.

 I never really thought of the top half of the index encoding
 as being particularly a coprocessor-number specific thing in
 the first place. It's just 16 bits of what is this thing
 anyway?, where each coprocessor gets a bit of the space, and
 so will the GIC, and the VFP regs, and so on. We just happened
 to use 0..15 of the what is this? space for cp0..cp15.

 (Incidentally, the term coprocessor is now basically just a
 historical artefact. The bits of the CPU you get at via the
 coprocessor registers and instruction encoding space are
 not separate functional units, they're part of the core.)

that's fine, but then the #define's shouldn't be called something with
COPROC in their names.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 16:21, Michael S. Tsirkin ha scritto:
  One reason is that, even though in practice I expect roughly the same
  number of targets and VCPUs, hotplug means the number of targets is
  difficult to predict and is usually fixed to 256.
  
  The other reason is that per-target vq didn't give any performance
  advantage.  The bonus comes from cache locality and less process
  migrations, more than from the independent virtqueues.
 
 Okay, and why is per-target worse for cache locality?

Because per-target doesn't have IRQ affinity for a particular CPU.

Assuming that the thread that is sending requests to the device is
I/O-bound, it is likely to be sleeping at the time the ISR is executed,
and thus executing the ISR on the same processor that sent the requests
is cheap.

But if you have many such I/O-bound processes, the kernel will execute
the ISR on a random processor, rather than the one that is sending
requests to the device.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 04:30:35PM +0200, Paolo Bonzini wrote:
 Il 04/09/2012 16:21, Michael S. Tsirkin ha scritto:
   One reason is that, even though in practice I expect roughly the same
   number of targets and VCPUs, hotplug means the number of targets is
   difficult to predict and is usually fixed to 256.
   
   The other reason is that per-target vq didn't give any performance
   advantage.  The bonus comes from cache locality and less process
   migrations, more than from the independent virtqueues.
  
  Okay, and why is per-target worse for cache locality?
 
 Because per-target doesn't have IRQ affinity for a particular CPU.
 
 Assuming that the thread that is sending requests to the device is
 I/O-bound, it is likely to be sleeping at the time the ISR is executed,
 and thus executing the ISR on the same processor that sent the requests
 is cheap.
 
 But if you have many such I/O-bound processes, the kernel will execute
 the ISR on a random processor, rather than the one that is sending
 requests to the device.
 
 Paolo

I see, another case where our irq balancing makes bad decisions.
You could do it differently - pin irq to the cpu of the last task that
executed, tweak irq affinity when that changes.
Still if you want to support 256 targets vector per target
is not going to work.

Would be nice to add this motivation to commit log I think.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Aug 28, 2012 at 01:54:17PM +0200, Paolo Bonzini wrote:
 @@ -575,15 +630,19 @@ static struct scsi_host_template virtscsi_host_template 
 = {
 __val, sizeof(__val)); \
   })
  
 +

Pls don't add empty lines.

  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
 -  struct virtqueue *vq)
 +  struct virtqueue *vq, bool affinity)
  {
   spin_lock_init(virtscsi_vq-vq_lock);
   virtscsi_vq-vq = vq;
 + if (affinity)
 + virtqueue_set_affinity(vq, virtqueue_get_queue_index(vq) -
 +VIRTIO_SCSI_VQ_BASE);
  }
  

This means in practice if you have less virtqueues than CPUs,
things are not going to work well, will they?

Any idea what to do?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt-users] vm pxe fail

2012-09-04 Thread Alex Jia
- Original Message -
From: Avi Kivity a...@redhat.com
To: Alex Jia a...@redhat.com
Cc: Andrew Holway a.hol...@syseleven.de, kvm@vger.kernel.org
Sent: Tuesday, September 4, 2012 7:44:36 PM
Subject: Re: [libvirt-users] vm pxe fail

On 09/04/2012 02:31 PM, Alex Jia wrote:
 - Original Message -
 From: Avi Kivity a...@redhat.com
 To: Alex Jia a...@redhat.com
 Cc: Andrew Holway a.hol...@syseleven.de, kvm@vger.kernel.org
 Sent: Monday, September 3, 2012 9:27:08 PM
 Subject: Re: [libvirt-users] vm pxe fail
 
 On 08/31/2012 05:37 PM, Alex Jia wrote:
 Hi Andrew,
 Great, BTW, in fact, you may pxe boot via VF of Intel82576, however, 
 Intel82576 SR-IOV network adapters 
 don't provide a ROM BIOS for the cards virtual functions (VF), but an image 
 of such a ROM is available, 
 and with this ROM visible to the guest, it can PXE boot.
 
 In libvirt's xml, you need to configure guest XML like this:
 
   hostdev mode='subsystem' type='pci' managed='yes'
 source
   address bus='XX' slot='XX' function='XX'/
 /source 
 boot order='1'/ 
rom bar='on' file='//ipxe-808610ca.rom'/
   /hostdev
 
 You need to build a ipxe-808610ca.rom by yourself, if you're interested in 
 this,
 please refer to http://ipxe.org/.
 
 Is there a way to automate this?  Perhaps a database matching PCI IDs
 and ipxe .roms, which qemu could consult?
 
Hi Avi,
Good question, I haven't try this via qemu yet, from libvirt POV, 
 basically, we may filter and parse 'lspci'
or 'virsh nodedev-list --tree' output to get a bus, slot and function 
 number then add them into above guest
XML, WRT above 'ipxe-808610ca.rom' file, we may directly 'git clone 
 git://git.ipxe.org/ipxe.git' then compile
it and generate a .rom file such as 82576.rom or use a vendor+product id 
 as a rom name if you like.

We could have qemu autoload /usr/share/qemu/roms/vendor-device.rom, and
symlink /usr/share/qemu/roms to /usr/share/ipxe/roms or something.

 Avi, good to know these, thanks :)
 

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Paolo Bonzini
Il 04/09/2012 16:47, Michael S. Tsirkin ha scritto:
   static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
  -   struct virtqueue *vq)
  +   struct virtqueue *vq, bool affinity)
   {
 spin_lock_init(virtscsi_vq-vq_lock);
 virtscsi_vq-vq = vq;
  +  if (affinity)
  +  virtqueue_set_affinity(vq, virtqueue_get_queue_index(vq) -
  + VIRTIO_SCSI_VQ_BASE);
   }
   
 This means in practice if you have less virtqueues than CPUs,
 things are not going to work well, will they?

Not particularly.  It could be better or worse than single queue
depending on the workload.

 Any idea what to do?

Two possibilities:

1) Add a stride argument to virtqueue_set_affinity, and make it equal to
the number of queues.

2) Make multiqueue the default in QEMU, and make the default number of
queues equal to the number of VCPUs.

I was going for (2).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 04:55:56PM +0200, Paolo Bonzini wrote:
 Il 04/09/2012 16:47, Michael S. Tsirkin ha scritto:
static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
   - struct virtqueue *vq)
   + struct virtqueue *vq, bool affinity)
{
spin_lock_init(virtscsi_vq-vq_lock);
virtscsi_vq-vq = vq;
   +if (affinity)
   +virtqueue_set_affinity(vq, 
   virtqueue_get_queue_index(vq) -
   +   VIRTIO_SCSI_VQ_BASE);
}

  This means in practice if you have less virtqueues than CPUs,
  things are not going to work well, will they?
 
 Not particularly.  It could be better or worse than single queue
 depending on the workload.

Well interrupts will go to CPU different from the one
that sends commands so ...

  Any idea what to do?
 
 Two possibilities:
 
 1) Add a stride argument to virtqueue_set_affinity, and make it equal to
 the number of queues.
 
 2) Make multiqueue the default in QEMU, and make the default number of
 queues equal to the number of VCPUs.
 
 I was going for (2).
 
 Paolo

3. use per target queue if less targets than cpus?

-- 
MST

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: tsc deadline timer works only when hrtimer high resolution configured

2012-09-04 Thread Liu, Jinsong
From 728a17e2de591b557c3c8ba31076b4bf2ca5ab42 Mon Sep 17 00:00:00 2001
From: Liu, Jinsong jinsong@intel.com
Date: Wed, 5 Sep 2012 03:18:15 +0800
Subject: [PATCH] KVM: tsc deadline timer works only when hrtimer high 
resolution configured

This is for 2 reasons:
1. it's pointless to enable tsc deadline timer to guest when kernel hrtimer
not configured as high resolution, since that would be un-precise based on 
wheel;
2. tsc deadline timer based on hrtimer, setting a leftmost node to rb tree
and then do hrtimer reprogram. If hrtimer not configured as high resolution,
hrtimer_enqueue_reprogram do nothing and would make tsc deadline timer fail.

Signed-off-by: Liu, Jinsong jinsong@intel.com
---
 arch/x86/kvm/x86.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 148ed66..0e64997 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2210,7 +2210,11 @@ int kvm_dev_ioctl_check_extension(long ext)
r = kvm_has_tsc_control;
break;
case KVM_CAP_TSC_DEADLINE_TIMER:
+#ifdef CONFIG_HIGH_RES_TIMERS
r = boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER);
+#else
+   r = 0;
+#endif
break;
default:
r = 0;
-- 
1.7.1


0001-KVM-tsc-deadline-timer-works-only-when-hrtimer-high-.patch
Description: 0001-KVM-tsc-deadline-timer-works-only-when-hrtimer-high-.patch


[PATCH v2 4/7] s390: Move css limits from drivers/s390/cio/ to include/asm/.

2012-09-04 Thread Cornelia Huck
There's no need to keep __MAX_SUBCHANNEL and __MAX_SSID private to the
common I/O layer when __MAX_CSSID is usable by everybody.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/include/asm/cio.h | 2 ++
 drivers/s390/cio/css.h  | 3 ---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/cio.h b/arch/s390/include/asm/cio.h
index 77043aa..9b6cc82 100644
--- a/arch/s390/include/asm/cio.h
+++ b/arch/s390/include/asm/cio.h
@@ -9,6 +9,8 @@
 
 #define LPM_ANYPATH 0xff
 #define __MAX_CSSID 0
+#define __MAX_SUBCHANNEL 65535
+#define __MAX_SSID 3
 
 #include asm/scsw.h
 
diff --git a/drivers/s390/cio/css.h b/drivers/s390/cio/css.h
index 33bb4d8..4af3dfe 100644
--- a/drivers/s390/cio/css.h
+++ b/drivers/s390/cio/css.h
@@ -112,9 +112,6 @@ extern int for_each_subchannel(int(*fn)(struct 
subchannel_id, void *), void *);
 extern void css_reiterate_subchannels(void);
 void css_update_ssd_info(struct subchannel *sch);
 
-#define __MAX_SUBCHANNEL 65535
-#define __MAX_SSID 3
-
 struct channel_subsystem {
u8 cssid;
int valid;
-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/7] s390/kvm: Base infrastructure for enabling capabilities.

2012-09-04 Thread Cornelia Huck
Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 Documentation/virtual/kvm/api.txt |  2 +-
 arch/s390/kvm/kvm-s390.c  | 26 ++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index b91bfd4..9c71aaa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -913,7 +913,7 @@ documentation when it pops into existence).
 4.37 KVM_ENABLE_CAP
 
 Capability: KVM_CAP_ENABLE_CAP
-Architectures: ppc
+Architectures: ppc, s390
 Type: vcpu ioctl
 Parameters: struct kvm_enable_cap (in)
 Returns: 0 on success; -1 on error
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index e83df7f..4b0681c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -140,6 +140,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 #endif
case KVM_CAP_SYNC_REGS:
case KVM_CAP_ONE_REG:
+   case KVM_CAP_ENABLE_CAP:
r = 1;
break;
case KVM_CAP_NR_VCPUS:
@@ -807,6 +808,22 @@ int kvm_s390_vcpu_store_status(struct kvm_vcpu *vcpu, 
unsigned long addr)
return 0;
 }
 
+static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
+struct kvm_enable_cap *cap)
+{
+   int r;
+
+   if (cap-flags)
+   return -EINVAL;
+
+   switch (cap-cap) {
+   default:
+   r = -EINVAL;
+   break;
+   }
+   return r;
+}
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg)
 {
@@ -893,6 +910,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_ENABLE_CAP:
+   {
+   struct kvm_enable_cap cap;
+   r = -EFAULT;
+   if (copy_from_user(cap, argp, sizeof(cap)))
+   break;
+   r = kvm_vcpu_ioctl_enable_cap(vcpu, cap);
+   break;
+   }
default:
r = -ENOTTY;
}
-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/7] s390/kvm: Support for I/O interrupts.

2012-09-04 Thread Cornelia Huck
Add support for handling I/O interrupts (standard, subchannel-related
ones and rudimentary adapter interrupts).

The subchannel-identifying parameters are encoded into the interrupt
type.

I/O interrupts are floating, so they can't be injected on a specific
vcpu.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/include/asm/kvm_host.h |   2 +
 arch/s390/kvm/interrupt.c| 115 +--
 include/linux/kvm.h  |   6 ++
 3 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index b784154..e47f697 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -76,6 +76,7 @@ struct kvm_s390_sie_block {
__u64   epoch;  /* 0x0038 */
__u8reserved40[4];  /* 0x0040 */
 #define LCTL_CR0   0x8000
+#define LCTL_CR6   0x0200
__u16   lctl;   /* 0x0044 */
__s16   icpua;  /* 0x0046 */
__u32   ictl;   /* 0x0048 */
@@ -127,6 +128,7 @@ struct kvm_vcpu_stat {
u32 deliver_prefix_signal;
u32 deliver_restart_signal;
u32 deliver_program_int;
+   u32 deliver_io_int;
u32 exit_wait_state;
u32 instruction_stidp;
u32 instruction_spx;
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 7556231..1dccfe7 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -21,11 +21,26 @@
 #include gaccess.h
 #include trace-s390.h
 
+#define IOINT_SCHID_MASK 0x
+#define IOINT_SSID_MASK 0x0003
+#define IOINT_CSSID_MASK 0x03fc
+#define IOINT_AI_MASK 0x0400
+
+static int is_ioint(u64 type)
+{
+   return ((type  0xfffeu) != 0xfffeu);
+}
+
 static int psw_extint_disabled(struct kvm_vcpu *vcpu)
 {
return !(vcpu-arch.sie_block-gpsw.mask  PSW_MASK_EXT);
 }
 
+static int psw_ioint_disabled(struct kvm_vcpu *vcpu)
+{
+   return !(vcpu-arch.sie_block-gpsw.mask  PSW_MASK_IO);
+}
+
 static int psw_interrupts_disabled(struct kvm_vcpu *vcpu)
 {
if ((vcpu-arch.sie_block-gpsw.mask  PSW_MASK_PER) ||
@@ -68,7 +83,18 @@ static int __interrupt_is_deliverable(struct kvm_vcpu *vcpu,
case KVM_S390_RESTART:
return 1;
default:
-   BUG();
+   if (is_ioint(inti-type)) {
+   if (psw_ioint_disabled(vcpu))
+   return 0;
+   if (vcpu-arch.sie_block-gcr[6] 
+   inti-io.io_int_word)
+   return 1;
+   return 0;
+   } else {
+   printk(KERN_WARNING illegal interrupt type %llx\n,
+  inti-type);
+   BUG();
+   }
}
return 0;
 }
@@ -117,6 +143,13 @@ static void __set_intercept_indicator(struct kvm_vcpu 
*vcpu,
__set_cpuflag(vcpu, CPUSTAT_STOP_INT);
break;
default:
+   if (is_ioint(inti-type)) {
+   if (psw_ioint_disabled(vcpu))
+   __set_cpuflag(vcpu, CPUSTAT_IO_INT);
+   else
+   vcpu-arch.sie_block-lctl |= LCTL_CR6;
+   break;
+   }
BUG();
}
 }
@@ -298,7 +331,49 @@ static void __do_deliver_interrupt(struct kvm_vcpu *vcpu,
break;
 
default:
-   BUG();
+   if (is_ioint(inti-type)) {
+   __u32 param0 = ((__u32)inti-io.subchannel_id  16) |
+   inti-io.subchannel_nr;
+   __u64 param1 = ((__u64)inti-io.io_int_parm  32) |
+   inti-io.io_int_word;
+   VCPU_EVENT(vcpu, 4,
+  interrupt: I/O %llx, inti-type);
+   vcpu-stat.deliver_io_int++;
+   trace_kvm_s390_deliver_interrupt(vcpu-vcpu_id, 
inti-type,
+param0, param1);
+   rc = put_guest_u16(vcpu, __LC_SUBCHANNEL_ID,
+  inti-io.subchannel_id);
+   if (rc == -EFAULT)
+   exception = 1;
+
+   rc = put_guest_u16(vcpu, __LC_SUBCHANNEL_NR,
+  inti-io.subchannel_nr);
+   if (rc == -EFAULT)
+   exception = 1;
+
+   rc = put_guest_u32(vcpu, __LC_IO_INT_PARM,
+  inti-io.io_int_parm);
+   if (rc == -EFAULT)
+   exception = 1;
+
+   rc = put_guest_u32(vcpu, __LC_IO_INT_WORD,
+ 

[PATCH v2 3/7] s390/kvm: In-kernel handling of I/O instructions.

2012-09-04 Thread Cornelia Huck
Explicitely catch all channel I/O related instructions intercepts
in the kernel and set condition code 3 for them.

This paves the way for properly handling these instructions later
on.

Note: This is not architecture compliant (the previous code wasn't
either) since setting cc 3 is not the correct thing to do for some
of these instructions. For Linux guests, however, it still has the
intended effect of stopping css probing.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/kvm/intercept.c | 19 +---
 arch/s390/kvm/kvm-s390.h  |  1 +
 arch/s390/kvm/priv.c  | 56 +--
 3 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index ec1177f..754dc9e 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -33,8 +33,6 @@ static int handle_lctlg(struct kvm_vcpu *vcpu)
int reg, rc;
 
vcpu-stat.instruction_lctlg++;
-   if ((vcpu-arch.sie_block-ipb  0xff) != 0x2f)
-   return -EOPNOTSUPP;
 
useraddr = disp2;
if (base2)
@@ -104,6 +102,21 @@ static int handle_lctl(struct kvm_vcpu *vcpu)
return 0;
 }
 
+static intercept_handler_t eb_handlers[256] = {
+   [0x2f] = handle_lctlg,
+   [0x8a] = kvm_s390_handle_priv_eb,
+};
+
+static int handle_eb(struct kvm_vcpu *vcpu)
+{
+   intercept_handler_t handler;
+
+   handler = eb_handlers[vcpu-arch.sie_block-ipb  0xff];
+   if (handler)
+   return handler(vcpu);
+   return -EOPNOTSUPP;
+}
+
 static intercept_handler_t instruction_handlers[256] = {
[0x01] = kvm_s390_handle_01,
[0x82] = kvm_s390_handle_lpsw,
@@ -113,7 +126,7 @@ static intercept_handler_t instruction_handlers[256] = {
[0xb7] = handle_lctl,
[0xb9] = kvm_s390_handle_b9,
[0xe5] = kvm_s390_handle_e5,
-   [0xeb] = handle_lctlg,
+   [0xeb] = handle_eb,
 };
 
 static int handle_noop(struct kvm_vcpu *vcpu)
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index b1e1cb6..7f50229 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -83,6 +83,7 @@ int kvm_s390_handle_e5(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_01(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_b9(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_lpsw(struct kvm_vcpu *vcpu);
+int kvm_s390_handle_priv_eb(struct kvm_vcpu *vcpu);
 
 /* implemented in sigp.c */
 int kvm_s390_handle_sigp(struct kvm_vcpu *vcpu);
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 7e7263c..8b79a94 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -135,20 +135,9 @@ static int handle_skey(struct kvm_vcpu *vcpu)
return 0;
 }
 
-static int handle_stsch(struct kvm_vcpu *vcpu)
+static int handle_io_inst(struct kvm_vcpu *vcpu)
 {
-   vcpu-stat.instruction_stsch++;
-   VCPU_EVENT(vcpu, 4, %s, store subchannel - CC3);
-   /* condition code 3 */
-   vcpu-arch.sie_block-gpsw.mask = ~(3ul  44);
-   vcpu-arch.sie_block-gpsw.mask |= (3  3ul)  44;
-   return 0;
-}
-
-static int handle_chsc(struct kvm_vcpu *vcpu)
-{
-   vcpu-stat.instruction_chsc++;
-   VCPU_EVENT(vcpu, 4, %s, channel subsystem call - CC3);
+   VCPU_EVENT(vcpu, 4, %s, I/O instruction);
/* condition code 3 */
vcpu-arch.sie_block-gpsw.mask = ~(3ul  44);
vcpu-arch.sie_block-gpsw.mask |= (3  3ul)  44;
@@ -392,7 +381,7 @@ out_fail:
return 0;
 }
 
-static intercept_handler_t priv_handlers[256] = {
+static intercept_handler_t b2_handlers[256] = {
[0x02] = handle_stidp,
[0x10] = handle_set_prefix,
[0x11] = handle_store_prefix,
@@ -400,8 +389,22 @@ static intercept_handler_t priv_handlers[256] = {
[0x29] = handle_skey,
[0x2a] = handle_skey,
[0x2b] = handle_skey,
-   [0x34] = handle_stsch,
-   [0x5f] = handle_chsc,
+   [0x30] = handle_io_inst,
+   [0x31] = handle_io_inst,
+   [0x32] = handle_io_inst,
+   [0x33] = handle_io_inst,
+   [0x34] = handle_io_inst,
+   [0x35] = handle_io_inst,
+   [0x36] = handle_io_inst,
+   [0x37] = handle_io_inst,
+   [0x38] = handle_io_inst,
+   [0x39] = handle_io_inst,
+   [0x3a] = handle_io_inst,
+   [0x3b] = handle_io_inst,
+   [0x3c] = handle_io_inst,
+   [0x5f] = handle_io_inst,
+   [0x74] = handle_io_inst,
+   [0x76] = handle_io_inst,
[0x7d] = handle_stsi,
[0xb1] = handle_stfl,
[0xb2] = handle_lpswe,
@@ -418,7 +421,7 @@ int kvm_s390_handle_b2(struct kvm_vcpu *vcpu)
 * state bit and (a) handle the instruction or (b) send a code 2
 * program check.
 * Anything else goes to userspace.*/
-   handler = priv_handlers[vcpu-arch.sie_block-ipa  0x00ff];
+   handler = b2_handlers[vcpu-arch.sie_block-ipa  0x00ff];
if (handler) {
if (vcpu-arch.sie_block-gpsw.mask  PSW_MASK_PSTATE)

[PATCH v2 2/4] s390: Add a mechanism to get the subchannel id.

2012-09-04 Thread Cornelia Huck
This will be needed by the new virtio-ccw transport.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1-v2:
- make it EXPORT_SYMBOL_GPL to get in line with other interfaces

---
 arch/s390/include/asm/ccwdev.h |  5 +
 drivers/s390/cio/device_ops.c  | 12 
 2 files changed, 17 insertions(+)

diff --git a/arch/s390/include/asm/ccwdev.h b/arch/s390/include/asm/ccwdev.h
index 1cb4bb3..9ad79f7 100644
--- a/arch/s390/include/asm/ccwdev.h
+++ b/arch/s390/include/asm/ccwdev.h
@@ -18,6 +18,9 @@ struct irb;
 struct ccw1;
 struct ccw_dev_id;
 
+/* from asm/schid.h */
+struct subchannel_id;
+
 /* simplified initializers for struct ccw_device:
  * CCW_DEVICE and CCW_DEVICE_DEVTYPE initialize one
  * entry in your MODULE_DEVICE_TABLE and set the match_flag correctly */
@@ -226,5 +229,7 @@ int ccw_device_siosl(struct ccw_device *);
 // FIXME: these have to go
 extern int _ccw_device_get_subchannel_number(struct ccw_device *);
 
+extern void ccw_device_get_schid(struct ccw_device *, struct subchannel_id *);
+
 extern void *ccw_device_get_chp_desc(struct ccw_device *, int);
 #endif /* _S390_CCWDEV_H_ */
diff --git a/drivers/s390/cio/device_ops.c b/drivers/s390/cio/device_ops.c
index ec7fb6d..2ad832f 100644
--- a/drivers/s390/cio/device_ops.c
+++ b/drivers/s390/cio/device_ops.c
@@ -763,6 +763,18 @@ _ccw_device_get_subchannel_number(struct ccw_device *cdev)
return cdev-private-schid.sch_no;
 }
 
+/**
+ * ccw_device_get_schid - obtain a subchannel id
+ * @cdev: device to obtain the id for
+ * @schid: where to fill in the values
+ */
+void ccw_device_get_schid(struct ccw_device *cdev, struct subchannel_id *schid)
+{
+   struct subchannel *sch = to_subchannel(cdev-dev.parent);
+
+   *schid = sch-schid;
+}
+EXPORT_SYMBOL_GPL(ccw_device_get_schid);
 
 MODULE_LICENSE(GPL);
 EXPORT_SYMBOL(ccw_device_set_options_mask);
-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 0/4] s390: virtio-ccw guest kernel support.

2012-09-04 Thread Cornelia Huck
Hi,

here's the second revision of the guest support for virtio-ccw.

The first patch has gotten several changes and now handles checking
for s390-virtio support much more nicely.

The third patch has been adapted to the changed virtio-ccw interface.

Cornelia Huck (4):
  s390/kvm: Handle hosts not supporting s390-virtio.
  s390: Add a mechanism to get the subchannel id.
  s390/kvm: Add a channel I/O based virtio transport driver.
  s390/kvm: Split out early console code.

 arch/s390/include/asm/ccwdev.h  |   5 +
 arch/s390/include/asm/irq.h |   1 +
 arch/s390/kernel/irq.c  |   1 +
 drivers/s390/cio/device_ops.c   |  12 +
 drivers/s390/kvm/Makefile   |   2 +-
 drivers/s390/kvm/early_printk.c |  42 +++
 drivers/s390/kvm/kvm_virtio.c   |  64 ++--
 drivers/s390/kvm/virtio_ccw.c   | 789 
 8 files changed, 882 insertions(+), 34 deletions(-)
 create mode 100644 drivers/s390/kvm/early_printk.c
 create mode 100644 drivers/s390/kvm/virtio_ccw.c

-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] s390/kvm: Split out early console code.

2012-09-04 Thread Cornelia Huck
This code is transport agnostic and can be used by both the legacy
virtio code and virtio_ccw.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 drivers/s390/kvm/Makefile   |  2 +-
 drivers/s390/kvm/early_printk.c | 42 +
 drivers/s390/kvm/kvm_virtio.c   | 29 ++--
 drivers/s390/kvm/virtio_ccw.c   |  1 -
 4 files changed, 45 insertions(+), 29 deletions(-)
 create mode 100644 drivers/s390/kvm/early_printk.c

diff --git a/drivers/s390/kvm/Makefile b/drivers/s390/kvm/Makefile
index 241891a..a3c8fc4 100644
--- a/drivers/s390/kvm/Makefile
+++ b/drivers/s390/kvm/Makefile
@@ -6,4 +6,4 @@
 # it under the terms of the GNU General Public License (version 2 only)
 # as published by the Free Software Foundation.
 
-obj-$(CONFIG_S390_GUEST) += kvm_virtio.o virtio_ccw.o
+obj-$(CONFIG_S390_GUEST) += kvm_virtio.o early_printk.o virtio_ccw.o
diff --git a/drivers/s390/kvm/early_printk.c b/drivers/s390/kvm/early_printk.c
new file mode 100644
index 000..7831530
--- /dev/null
+++ b/drivers/s390/kvm/early_printk.c
@@ -0,0 +1,42 @@
+/*
+ * early_printk.c - code for early console output with virtio_console
+ * split off from kvm_virtio.c
+ *
+ * Copyright IBM Corp. 2008
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ *Author(s): Christian Borntraeger borntrae...@de.ibm.com
+ */
+
+#include linux/kernel_stat.h
+#include linux/init.h
+#include linux/err.h
+#include linux/virtio_console.h
+#include asm/kvm_para.h
+#include asm/kvm_virtio.h
+#include asm/setup.h
+#include asm/sclp.h
+
+static __init int early_put_chars(u32 vtermno, const char *buf, int count)
+{
+   char scratch[17];
+   unsigned int len = count;
+
+   if (len  sizeof(scratch) - 1)
+   len = sizeof(scratch) - 1;
+   scratch[len] = '\0';
+   memcpy(scratch, buf, len);
+   kvm_hypercall1(KVM_S390_VIRTIO_NOTIFY, __pa(scratch));
+   return len;
+}
+
+static int __init s390_virtio_console_init(void)
+{
+   if (sclp_has_vt220() || sclp_has_linemode())
+   return -ENODEV;
+   return virtio_cons_early_init(early_put_chars);
+}
+console_initcall(s390_virtio_console_init);
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 76b95f3..6cdc66a 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -17,7 +17,6 @@
 #include linux/virtio.h
 #include linux/virtio_config.h
 #include linux/slab.h
-#include linux/virtio_console.h
 #include linux/interrupt.h
 #include linux/virtio_ring.h
 #include linux/export.h
@@ -25,9 +24,9 @@
 #include asm/io.h
 #include asm/kvm_para.h
 #include asm/kvm_virtio.h
-#include asm/sclp.h
 #include asm/setup.h
 #include asm/irq.h
+#include asm/sclp.h
 
 #define VIRTIO_SUBCODE_64 0x0D00
 
@@ -450,8 +449,7 @@ static int __init kvm_devices_init(void)
return -ENODEV;
 
if (test_devices_support(real_memory_size)  0)
-   /* No error. */
-   return 0;
+   return -ENODEV;
 
rc = vmem_add_mapping(real_memory_size, PAGE_SIZE);
if (rc)
@@ -476,29 +474,6 @@ static int __init kvm_devices_init(void)
return 0;
 }
 
-/* code for early console output with virtio_console */
-static __init int early_put_chars(u32 vtermno, const char *buf, int count)
-{
-   char scratch[17];
-   unsigned int len = count;
-
-   if (len  sizeof(scratch) - 1)
-   len = sizeof(scratch) - 1;
-   scratch[len] = '\0';
-   memcpy(scratch, buf, len);
-   kvm_hypercall1(KVM_S390_VIRTIO_NOTIFY, __pa(scratch));
-   return len;
-}
-
-static int __init s390_virtio_console_init(void)
-{
-   if (sclp_has_vt220() || sclp_has_linemode())
-   return -ENODEV;
-   return virtio_cons_early_init(early_put_chars);
-}
-console_initcall(s390_virtio_console_init);
-
-
 /*
  * We do this after core stuff, but before the drivers.
  */
diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
index 1c9af22..14ae293 100644
--- a/drivers/s390/kvm/virtio_ccw.c
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -17,7 +17,6 @@
 #include linux/virtio.h
 #include linux/virtio_config.h
 #include linux/slab.h
-#include linux/virtio_console.h
 #include linux/interrupt.h
 #include linux/virtio_ring.h
 #include linux/pfn.h
-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 0/7] s390: virtual css host support.

2012-09-04 Thread Cornelia Huck
Hi,

here's the second revision of the virtual channel subsystem support for
s390.

I changed the representation of the channel subsystem, introducing channel
subsystem images, which brings it closer to the actual implementation. A
new ioctl for adding a new channel subsystem image has also been introduced.

Cornelia Huck (7):
  s390/kvm: Support for I/O interrupts.
  s390/kvm: Add support for machine checks.
  s390/kvm: In-kernel handling of I/O instructions.
  s390: Move css limits from drivers/s390/cio/ to include/asm/.
  s390: Make some css-related structures usable by non-cio code.
  s390/kvm: Base infrastructure for enabling capabilities.
  s390/kvm: In-kernel channel subsystem support.

 Documentation/virtual/kvm/api.txt | 155 +-
 arch/s390/include/asm/cio.h   |   2 +
 arch/s390/include/asm/kvm_host.h  |  63 +++
 arch/s390/include/asm/orb.h   |  69 +++
 arch/s390/include/asm/schib.h |  52 ++
 arch/s390/kvm/Makefile|   2 +-
 arch/s390/kvm/css.c   | 989 ++
 arch/s390/kvm/intercept.c |  22 +-
 arch/s390/kvm/interrupt.c | 337 +++--
 arch/s390/kvm/ioinst.c| 797 ++
 arch/s390/kvm/kvm-s390.c  |  70 +++
 arch/s390/kvm/kvm-s390.h  |  43 ++
 arch/s390/kvm/priv.c  | 194 +++-
 arch/s390/kvm/trace-s390.h|  73 ++-
 arch/s390/kvm/trace.h |  22 +
 drivers/s390/cio/cio.h|  46 +-
 drivers/s390/cio/css.h|   3 -
 drivers/s390/cio/io_sch.h |   2 +-
 drivers/s390/cio/ioasm.h  |   2 +-
 drivers/s390/cio/orb.h|  67 ---
 include/linux/kvm.h   |  67 +++
 include/trace/events/kvm.h|   2 +-
 virt/kvm/kvm_main.c   |   3 +-
 23 files changed, 2908 insertions(+), 174 deletions(-)
 create mode 100644 arch/s390/include/asm/orb.h
 create mode 100644 arch/s390/include/asm/schib.h
 create mode 100644 arch/s390/kvm/css.c
 create mode 100644 arch/s390/kvm/ioinst.c
 delete mode 100644 drivers/s390/cio/orb.h

-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] s390/kvm: Add support for machine checks.

2012-09-04 Thread Cornelia Huck
Add support for injecting machine checks (only repressible
conditions for now).

This is a bit more involved than I/O interrupts, for these reasons:

- Machine checks come in both floating and cpu varieties.
- We don't have a bit for machine checks enabling, but have to use
  a roundabout approach with trapping PSW changing instructions and
  watching for opened machine checks.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/include/asm/kvm_host.h |   8 +++
 arch/s390/kvm/intercept.c|   2 +
 arch/s390/kvm/interrupt.c| 111 
 arch/s390/kvm/kvm-s390.h |   3 +
 arch/s390/kvm/priv.c | 133 +++
 arch/s390/kvm/trace-s390.h   |   6 +-
 include/linux/kvm.h  |   1 +
 7 files changed, 261 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index e47f697..556774d 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -77,8 +77,10 @@ struct kvm_s390_sie_block {
__u8reserved40[4];  /* 0x0040 */
 #define LCTL_CR0   0x8000
 #define LCTL_CR6   0x0200
+#define LCTL_CR14  0x0002
__u16   lctl;   /* 0x0044 */
__s16   icpua;  /* 0x0046 */
+#define ICTL_LPSW 0x0200
__u32   ictl;   /* 0x0048 */
__u32   eca;/* 0x004c */
__u8icptcode;   /* 0x0050 */
@@ -189,6 +191,11 @@ struct kvm_s390_emerg_info {
__u16 code;
 };
 
+struct kvm_s390_mchk_info {
+   __u64 cr14;
+   __u64 mcic;
+};
+
 struct kvm_s390_interrupt_info {
struct list_head list;
u64 type;
@@ -199,6 +206,7 @@ struct kvm_s390_interrupt_info {
struct kvm_s390_emerg_info emerg;
struct kvm_s390_extcall_info extcall;
struct kvm_s390_prefix_info prefix;
+   struct kvm_s390_mchk_info mchk;
};
 };
 
diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index 22798ec..ec1177f 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -106,10 +106,12 @@ static int handle_lctl(struct kvm_vcpu *vcpu)
 
 static intercept_handler_t instruction_handlers[256] = {
[0x01] = kvm_s390_handle_01,
+   [0x82] = kvm_s390_handle_lpsw,
[0x83] = kvm_s390_handle_diag,
[0xae] = kvm_s390_handle_sigp,
[0xb2] = kvm_s390_handle_b2,
[0xb7] = handle_lctl,
+   [0xb9] = kvm_s390_handle_b9,
[0xe5] = kvm_s390_handle_e5,
[0xeb] = handle_lctlg,
 };
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 1dccfe7..edc065f 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -41,6 +41,11 @@ static int psw_ioint_disabled(struct kvm_vcpu *vcpu)
return !(vcpu-arch.sie_block-gpsw.mask  PSW_MASK_IO);
 }
 
+static int psw_mchk_disabled(struct kvm_vcpu *vcpu)
+{
+   return !(vcpu-arch.sie_block-gpsw.mask  PSW_MASK_MCHECK);
+}
+
 static int psw_interrupts_disabled(struct kvm_vcpu *vcpu)
 {
if ((vcpu-arch.sie_block-gpsw.mask  PSW_MASK_PER) ||
@@ -82,6 +87,12 @@ static int __interrupt_is_deliverable(struct kvm_vcpu *vcpu,
case KVM_S390_SIGP_SET_PREFIX:
case KVM_S390_RESTART:
return 1;
+   case KVM_S390_MCHK:
+   if (psw_mchk_disabled(vcpu))
+   return 0;
+   if (vcpu-arch.sie_block-gcr[14]  inti-mchk.cr14)
+   return 1;
+   return 0;
default:
if (is_ioint(inti-type)) {
if (psw_ioint_disabled(vcpu))
@@ -119,6 +130,7 @@ static void __reset_intercept_indicators(struct kvm_vcpu 
*vcpu)
CPUSTAT_IO_INT | CPUSTAT_EXT_INT | CPUSTAT_STOP_INT,
vcpu-arch.sie_block-cpuflags);
vcpu-arch.sie_block-lctl = 0x;
+   vcpu-arch.sie_block-ictl = ~ICTL_LPSW;
 }
 
 static void __set_cpuflag(struct kvm_vcpu *vcpu, u32 flag)
@@ -142,6 +154,12 @@ static void __set_intercept_indicator(struct kvm_vcpu 
*vcpu,
case KVM_S390_SIGP_STOP:
__set_cpuflag(vcpu, CPUSTAT_STOP_INT);
break;
+   case KVM_S390_MCHK:
+   if (psw_mchk_disabled(vcpu))
+   vcpu-arch.sie_block-ictl |= ICTL_LPSW;
+   else
+   vcpu-arch.sie_block-lctl |= LCTL_CR14;
+   break;
default:
if (is_ioint(inti-type)) {
if (psw_ioint_disabled(vcpu))
@@ -330,6 +348,32 @@ static void __do_deliver_interrupt(struct kvm_vcpu *vcpu,
exception = 1;
break;
 
+   case KVM_S390_MCHK:
+   VCPU_EVENT(vcpu, 4, interrupt: machine check mcic=%llx,
+  inti-mchk.mcic);
+   trace_kvm_s390_deliver_interrupt(vcpu-vcpu_id, 

[RFC v2] s390: virtual channel subsystem and new virtio transport.

2012-09-04 Thread Cornelia Huck
Hi,

I have incorporated the feedback I received to my first RFC for
virtio-ccw (http://marc.info/?l=kvmm=134435141402140w=2) and will
post the updates shortly.

Patches will again be sorted into kernel host and guest, qemu, and
virtio spec.

Feedback is still welcome.

Cornelia

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2] Update virtio spec for virtio-ccw.

2012-09-04 Thread Cornelia Huck
Hi,

this is the second revision of the virtio-ccw spec.

The interface has been improved to support more than 32 feature bits
as well as allocating less that the requested queue size and specifying
alignment.

(Note: I folded the changes with git into my initial spec; I hope LyX can
handle this :)

Cornelia Huck (1):
  virtio-spec: Add virtio-ccw spec.

 virtio-spec.lyx |  534 +++
 1 files changed, 534 insertions(+), 0 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] s390/kvm: Add a channel I/O based virtio transport driver.

2012-09-04 Thread Cornelia Huck
Add a driver for kvm guests that matches virtual ccw devices provided
by the host as virtio bridge devices.

These virtio-ccw devices use a special set of channel commands in order
to perform virtio functions.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1-v2:
- adapt to improved virtio-ccw channel commands
- fix unregistration of online devices
- add a missing spinlock initialization

---
 arch/s390/include/asm/irq.h   |   1 +
 arch/s390/kernel/irq.c|   1 +
 drivers/s390/kvm/Makefile |   2 +-
 drivers/s390/kvm/virtio_ccw.c | 790 ++
 4 files changed, 793 insertions(+), 1 deletion(-)
 create mode 100644 drivers/s390/kvm/virtio_ccw.c

diff --git a/arch/s390/include/asm/irq.h b/arch/s390/include/asm/irq.h
index 2b9d418..b4bea53 100644
--- a/arch/s390/include/asm/irq.h
+++ b/arch/s390/include/asm/irq.h
@@ -31,6 +31,7 @@ enum interruption_class {
IOINT_CTC,
IOINT_APB,
IOINT_CSC,
+   IOINT_VIR,
NMI_NMI,
NR_IRQS,
 };
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index dd7630d..2cc7eed 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -56,6 +56,7 @@ static const struct irq_class intrclass_names[] = {
{.name = CTC, .desc = [I/O] CTC },
{.name = APB, .desc = [I/O] AP Bus },
{.name = CSC, .desc = [I/O] CHSC Subchannel },
+   {.name = VIR, .desc = [I/O] Virtual I/O Devices },
{.name = NMI, .desc = [NMI] Machine Check },
 };
 
diff --git a/drivers/s390/kvm/Makefile b/drivers/s390/kvm/Makefile
index 0815690..241891a 100644
--- a/drivers/s390/kvm/Makefile
+++ b/drivers/s390/kvm/Makefile
@@ -6,4 +6,4 @@
 # it under the terms of the GNU General Public License (version 2 only)
 # as published by the Free Software Foundation.
 
-obj-$(CONFIG_S390_GUEST) += kvm_virtio.o
+obj-$(CONFIG_S390_GUEST) += kvm_virtio.o virtio_ccw.o
diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
new file mode 100644
index 000..1c9af22
--- /dev/null
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -0,0 +1,790 @@
+/*
+ * ccw based virtio transport
+ *
+ * Copyright IBM Corp. 2012
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ *Author(s): Cornelia Huck cornelia.h...@de.ibm.com
+ */
+
+#include linux/kernel_stat.h
+#include linux/init.h
+#include linux/bootmem.h
+#include linux/err.h
+#include linux/virtio.h
+#include linux/virtio_config.h
+#include linux/slab.h
+#include linux/virtio_console.h
+#include linux/interrupt.h
+#include linux/virtio_ring.h
+#include linux/pfn.h
+#include linux/async.h
+#include linux/wait.h
+#include linux/list.h
+#include linux/bitops.h
+#include linux/module.h
+#include asm/io.h
+#include asm/kvm_para.h
+#include asm/setup.h
+#include asm/irq.h
+#include asm/cio.h
+#include asm/ccwdev.h
+
+/*
+ * virtio related functions
+ */
+
+struct vq_config_block {
+   __u16 index;
+   __u16 num;
+} __attribute__ ((packed));
+
+#define VIRTIO_CCW_CONFIG_SIZE 0x100
+/* same as PCI config space size, should be enough for all drivers */
+
+struct virtio_ccw_device {
+   struct virtio_device vdev;
+   __u8 status;
+   __u8 config[VIRTIO_CCW_CONFIG_SIZE];
+   struct ccw_device *cdev;
+   struct ccw1 ccw;
+   __u32 area;
+   __u32 curr_io;
+   int err;
+   wait_queue_head_t wait_q;
+   spinlock_t lock;
+   struct list_head virtqueues;
+   unsigned long indicators; /* XXX - works because we're under 64 bit */
+   struct vq_config_block *config_block;
+};
+
+struct vq_info_block {
+   __u64 queue;
+   __u32 align;
+   __u16 index;
+   __u16 num;
+} __attribute__ ((packed));
+
+struct virtio_feature_desc {
+   __u32 features;
+   __u8 index;
+} __attribute__ ((packed));
+
+struct virtio_ccw_vq_info {
+   struct virtqueue *vq;
+   int num;
+   int queue_index;
+   void *queue;
+   struct vq_info_block *info_block;
+   struct list_head node;
+};
+
+#define KVM_VIRTIO_CCW_RING_ALIGN 4096
+
+#define CCW_CMD_SET_VQ 0x13
+#define CCW_CMD_VDEV_RESET 0x33
+#define CCW_CMD_SET_IND 0x43
+#define CCW_CMD_READ_FEAT 0x12
+#define CCW_CMD_WRITE_FEAT 0x11
+#define CCW_CMD_READ_CONF 0x22
+#define CCW_CMD_WRITE_CONF 0x21
+#define CCW_CMD_WRITE_STATUS 0x31
+#define CCW_CMD_READ_VQ_CONF 0x32
+
+#define VIRTIO_CCW_DOING_SET_VQ 0x0001
+#define VIRTIO_CCW_DOING_RESET 0x0004
+#define VIRTIO_CCW_DOING_READ_FEAT 0x0008
+#define VIRTIO_CCW_DOING_WRITE_FEAT 0x0010
+#define VIRTIO_CCW_DOING_READ_CONFIG 0x0020
+#define VIRTIO_CCW_DOING_WRITE_CONFIG 0x0040
+#define VIRTIO_CCW_DOING_WRITE_STATUS 0x0080
+#define VIRTIO_CCW_DOING_SET_IND 0x0100
+#define VIRTIO_CCW_DOING_READ_VQ_CONF 0x0200
+#define VIRTIO_CCW_INTPARM_MASK 0x
+
+static struct virtio_ccw_device 

[PATCH v2 5/5] [HACK] Handle multiple virtio aliases.

2012-09-04 Thread Cornelia Huck
This patch enables using both virtio-xxx-s390 and virtio-xxx-ccw
by making the alias lookup code verify that a driver is actually
registered.

(Only included in order to allow testing of virtio-ccw; should be
replaced by cleaning up the virtio bus model.)

Not-signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 blockdev.c|  6 +---
 hw/qdev-monitor.c | 85 +--
 vl.c  |  6 +---
 3 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 7c83baa..a7c39b6 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -560,11 +560,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 case IF_VIRTIO:
 /* add virtio block device */
 opts = qemu_opts_create(qemu_find_opts(device), NULL, 0, NULL);
-if (arch_type == QEMU_ARCH_S390X) {
-qemu_opt_set(opts, driver, virtio-blk-s390);
-} else {
-qemu_opt_set(opts, driver, virtio-blk-pci);
-}
+qemu_opt_set(opts, driver, virtio-blk);
 qemu_opt_set(opts, drive, dinfo-id);
 if (devaddr)
 qemu_opt_set(opts, addr, devaddr);
diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
index 92b7c59..9245a1e 100644
--- a/hw/qdev-monitor.c
+++ b/hw/qdev-monitor.c
@@ -118,9 +118,53 @@ static int set_property(const char *name, const char 
*value, void *opaque)
 return 0;
 }
 
-static const char *find_typename_by_alias(const char *alias)
+static BusState *qbus_find_recursive(BusState *bus, const char *name,
+ const char *bus_typename)
+{
+BusChild *kid;
+BusState *child, *ret;
+int match = 1;
+
+if (name  (strcmp(bus-name, name) != 0)) {
+match = 0;
+}
+if (bus_typename 
+(strcmp(object_get_typename(OBJECT(bus)), bus_typename) != 0)) {
+match = 0;
+}
+if (match) {
+return bus;
+}
+
+QTAILQ_FOREACH(kid, bus-children, sibling) {
+DeviceState *dev = kid-child;
+QLIST_FOREACH(child, dev-child_bus, sibling) {
+ret = qbus_find_recursive(child, name, bus_typename);
+if (ret) {
+return ret;
+}
+}
+}
+return NULL;
+}
+
+static bool qdev_verify_bus(DeviceClass *dc)
+{
+BusState *bus;
+
+if (dc) {
+bus = qbus_find_recursive(sysbus_get_default(), NULL, dc-bus_type);
+if (bus) {
+return true;
+}
+}
+return false;
+}
+
+static const char *find_typename_by_alias(const char *alias, bool check_bus)
 {
 int i;
+ObjectClass *oc;
 
 for (i = 0; qdev_alias_table[i].alias; i++) {
 if (qdev_alias_table[i].arch_mask 
@@ -129,7 +173,10 @@ static const char *find_typename_by_alias(const char 
*alias)
 }
 
 if (strcmp(qdev_alias_table[i].alias, alias) == 0) {
-return qdev_alias_table[i].typename;
+oc = object_class_by_name(qdev_alias_table[i].typename);
+if (oc  (!check_bus || qdev_verify_bus(DEVICE_CLASS(oc {
+return qdev_alias_table[i].typename;
+}
 }
 }
 
@@ -155,7 +202,7 @@ int qdev_device_help(QemuOpts *opts)
 
 klass = object_class_by_name(driver);
 if (!klass) {
-const char *typename = find_typename_by_alias(driver);
+const char *typename = find_typename_by_alias(driver, false);
 
 if (typename) {
 driver = typename;
@@ -283,36 +330,6 @@ static DeviceState *qbus_find_dev(BusState *bus, char 
*elem)
 return NULL;
 }
 
-static BusState *qbus_find_recursive(BusState *bus, const char *name,
- const char *bus_typename)
-{
-BusChild *kid;
-BusState *child, *ret;
-int match = 1;
-
-if (name  (strcmp(bus-name, name) != 0)) {
-match = 0;
-}
-if (bus_typename 
-(strcmp(object_get_typename(OBJECT(bus)), bus_typename) != 0)) {
-match = 0;
-}
-if (match) {
-return bus;
-}
-
-QTAILQ_FOREACH(kid, bus-children, sibling) {
-DeviceState *dev = kid-child;
-QLIST_FOREACH(child, dev-child_bus, sibling) {
-ret = qbus_find_recursive(child, name, bus_typename);
-if (ret) {
-return ret;
-}
-}
-}
-return NULL;
-}
-
 static BusState *qbus_find(const char *path)
 {
 DeviceState *dev;
@@ -417,7 +434,7 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 /* find driver */
 obj = object_class_by_name(driver);
 if (!obj) {
-const char *typename = find_typename_by_alias(driver);
+const char *typename = find_typename_by_alias(driver, true);
 
 if (typename) {
 driver = typename;
diff --git a/vl.c b/vl.c
index 2b8cae6..788a536 100644
--- a/vl.c
+++ b/vl.c
@@ -2113,11 +2113,7 @@ static int virtcon_parse(const char *devname)
 }
 
 bus_opts = qemu_opts_create(device, NULL, 0, NULL);
-if 

[PATCH v2] virtio-spec: Add virtio-ccw spec.

2012-09-04 Thread Cornelia Huck
Add specifications for the new s390 specific virtio-ccw transport.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1-v2:
- support more than 32 feature bits
- allow to allocate less than requested queue size
- allow to transfer alignment

---
 virtio-spec.lyx |  534 +++
 1 files changed, 534 insertions(+), 0 deletions(-)

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index 7a073f4..8247d2e 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -57,6 +57,7 @@
 \html_css_as_file 0
 \html_be_strict false
 \author -608949062 Rusty Russell,,, 
+\author -385801441 Cornelia Huck cornelia.h...@de.ibm.com
 \author 1531152142 Paolo Bonzini,,, 
 \end_header
 
@@ -9350,8 +9351,541 @@ tatus register description is asserted.
  After the interrupt is handled, the driver must acknowledge it by writing
  a bit mask corresponding to the serviced interrupt to the InterruptACK
  register.
+\change_inserted -385801441 1343732742
+
 \end_layout
 
 \end_deeper
+\begin_layout Chapter*
+
+\change_inserted -385801441 1343732726
+Appendix Y: virtio-ccw
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732726
+S/390 based virtual machines support neither PCI nor MMIO, so a different
+ transport is needed there.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732726
+The old s390-virtio mechanism used a special page mapped above the guest's
+ memory and several diagnose calls (hypercalls); it does have some drawbacks,
+ however, like a rather limited number of devices and very restricted hotplug
+ support.
+ Moreover, device discovery and operation differ from other environments
+ on the S/390 platform.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732726
+virtio-ccw uses the standard channel I/O based mechanism used for the majority
+ of devices on S/390.
+ A virtual channel device with a special control unit type acts as proxy
+ to the virtio device (similar to the way virtio-pci uses a PCI device)
+ and configuration and operation of the virtio device is accomplished (mostly)
+ via channel commands.
+ This means virtio devices are discoverable via standard operating system
+ algorithms, and adding virtio support is mainly a question of supporting
+ a new control unit type.
+\end_layout
+
+\begin_layout Subsection*
+
+\change_inserted -385801441 1343732726
+Basic Concepts
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732817
+As a proxy device, virtio-ccw uses a channel-attached I/O control unit with
+ a special control unit type (0x3832) and a control unit model corresponding
+ to the attached virtio device's subsystem device ID, accessed via a virtual
+ I/O subchannel and a virtual channel path of type 0x32.
+ This proxy device is discoverable via normal channel subsystem device 
discovery
+ (usually a STORE SUBCHANNEL loop) and answers to the basic channel commands,
+ most importantly SENSE ID.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732726
+In addition to the basic channel commands, virtio-ccw defines a set of channel
+ commands related to configuration and operation of virtio:
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+\begin_inset listings
+inline false
+status open
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_SET_VQ 0x13
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_VDEV_RESET 0x33
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_SET_IND 0x43
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_READ_FEAT 0x12
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_WRITE_FEAT 0x11
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_READ_CONF 0x22 
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_WRITE_CONF 0x21
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_WRITE_STATUS 0x31
+\end_layout
+
+\begin_layout LyX-Code
+
+\change_inserted -385801441 1343732726
+
+#define CCW_CMD_READ_VQ_CONF 0x32
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Subsection*
+
+\change_inserted -385801441 1343732726
+Device Initialization
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732726
+virtio-ccw uses several channel commands to set up a device.
+\end_layout
+
+\begin_layout Subsubsection*
+
+\change_inserted -385801441 1343732726
+Configuring a Virtqueue
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted -385801441 1343732726
+CCW_CMD_READ_VQ_CONF is issued by the guest to obtain information about
+ a queue.
+ It uses the following structure for communicating:

[PATCH v2 3/5] s390: Add new channel I/O based virtio transport.

2012-09-04 Thread Cornelia Huck
Add a new virtio transport that uses channel commands to perform
virtio operations.

Add a new machine type s390-ccw that uses this virtio-ccw transport
and make it the default machine for s390.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1-v2:
- update to virtio-ccw interface changes

---
 hw/qdev-monitor.c  |   5 +
 hw/s390-virtio.c   | 277 
 hw/s390x/Makefile.objs |   1 +
 hw/s390x/css.c |  45 +++
 hw/s390x/css.h |   3 +
 hw/s390x/virtio-ccw.c  | 875 +
 hw/s390x/virtio-ccw.h  |  79 +
 vl.c   |   1 +
 8 files changed, 1215 insertions(+), 71 deletions(-)
 create mode 100644 hw/s390x/virtio-ccw.c
 create mode 100644 hw/s390x/virtio-ccw.h

diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
index 33b7f79..92b7c59 100644
--- a/hw/qdev-monitor.c
+++ b/hw/qdev-monitor.c
@@ -42,6 +42,11 @@ static const QDevAlias qdev_alias_table[] = {
 { virtio-blk-s390, virtio-blk, QEMU_ARCH_S390X },
 { virtio-net-s390, virtio-net, QEMU_ARCH_S390X },
 { virtio-serial-s390, virtio-serial, QEMU_ARCH_S390X },
+{ virtio-blk-ccw, virtio-blk, QEMU_ARCH_S390X },
+{ virtio-net-ccw, virtio-net, QEMU_ARCH_S390X },
+{ virtio-serial-ccw, virtio-serial, QEMU_ARCH_S390X },
+{ virtio-balloon-ccw, virtio-balloon, QEMU_ARCH_S390X },
+{ virtio-scsi-ccw, virtio-scsi, QEMU_ARCH_S390X },
 { lsi53c895a, lsi },
 { ich9-ahci, ahci },
 { }
diff --git a/hw/s390-virtio.c b/hw/s390-virtio.c
index 47eed35..2509291 100644
--- a/hw/s390-virtio.c
+++ b/hw/s390-virtio.c
@@ -30,8 +30,11 @@
 #include hw/sysbus.h
 #include kvm.h
 #include exec-memory.h
+#include qemu-thread.h
 
 #include hw/s390-virtio-bus.h
+#include hw/s390x/css.h
+#include hw/s390x/virtio-ccw.h
 
 //#define DEBUG_S390
 
@@ -46,6 +49,7 @@
 #define KVM_S390_VIRTIO_NOTIFY  0
 #define KVM_S390_VIRTIO_RESET   1
 #define KVM_S390_VIRTIO_SET_STATUS  2
+#define KVM_S390_VIRTIO_CCW_NOTIFY  3
 
 #define KERN_IMAGE_START0x01UL
 #define KERN_PARM_AREA  0x010480UL
@@ -62,6 +66,7 @@
 
 static VirtIOS390Bus *s390_bus;
 static S390CPU **ipi_states;
+VirtioCcwBus *ccw_bus;
 
 S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
 {
@@ -75,15 +80,21 @@ S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
 int s390_virtio_hypercall(CPUS390XState *env, uint64_t mem, uint64_t hypercall)
 {
 int r = 0, i;
+int cssid, ssid, schid, m;
+SubchDev *sch;
 
 dprintf(KVM hypercall: %ld\n, hypercall);
 switch (hypercall) {
 case KVM_S390_VIRTIO_NOTIFY:
 if (mem  ram_size) {
-VirtIOS390Device *dev = s390_virtio_bus_find_vring(s390_bus,
-   mem, i);
-if (dev) {
-virtio_queue_notify(dev-vdev, i);
+if (s390_bus) {
+VirtIOS390Device *dev = s390_virtio_bus_find_vring(s390_bus,
+   mem, i);
+if (dev) {
+virtio_queue_notify(dev-vdev, i);
+} else {
+r = -EINVAL;
+}
 } else {
 r = -EINVAL;
 }
@@ -92,28 +103,49 @@ int s390_virtio_hypercall(CPUS390XState *env, uint64_t 
mem, uint64_t hypercall)
 }
 break;
 case KVM_S390_VIRTIO_RESET:
-{
-VirtIOS390Device *dev;
-
-dev = s390_virtio_bus_find_mem(s390_bus, mem);
-virtio_reset(dev-vdev);
-stb_phys(dev-dev_offs + VIRTIO_DEV_OFFS_STATUS, 0);
-s390_virtio_device_sync(dev);
-s390_virtio_reset_idx(dev);
+if (s390_bus) {
+VirtIOS390Device *dev;
+
+dev = s390_virtio_bus_find_mem(s390_bus, mem);
+virtio_reset(dev-vdev);
+stb_phys(dev-dev_offs + VIRTIO_DEV_OFFS_STATUS, 0);
+s390_virtio_device_sync(dev);
+s390_virtio_reset_idx(dev);
+} else {
+r = -EINVAL;
+}
 break;
-}
 case KVM_S390_VIRTIO_SET_STATUS:
-{
-VirtIOS390Device *dev;
+if (s390_bus) {
+VirtIOS390Device *dev;
 
-dev = s390_virtio_bus_find_mem(s390_bus, mem);
-if (dev) {
-s390_virtio_device_update_status(dev);
+dev = s390_virtio_bus_find_mem(s390_bus, mem);
+if (dev) {
+s390_virtio_device_update_status(dev);
+} else {
+r = -EINVAL;
+}
 } else {
 r = -EINVAL;
 }
 break;
-}
+case KVM_S390_VIRTIO_CCW_NOTIFY:
+if (ccw_bus) {
+if (ioinst_disassemble_sch_ident(env-regs[2], m, cssid, ssid,
+ schid)) {
+r = -EINVAL;
+} else {
+sch = css_find_subch(m, cssid, ssid, schid);
+if (sch) {
+  

[PATCH v2 2/5] s390: Virtual channel subsystem support.

2012-09-04 Thread Cornelia Huck
Provide a mechanism for qemu to provide fully virtual subchannels to
the guest. In the KVM case, this relies on the kernel's css support.
The !KVM case is not yet supported.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1 - v2:
- coding style
- re-organization of hardware structures (channel subsystem vs. channel
  subsystem image)
- use new KVM_S390_ADD_CSS ioctl

---
 hw/s390x/Makefile.objs |   1 +
 hw/s390x/css.c | 490 +
 hw/s390x/css.h |  60 ++
 target-s390x/Makefile.objs |   2 +-
 target-s390x/cpu.h | 126 
 target-s390x/ioinst.c  |  38 
 target-s390x/ioinst.h  | 173 
 target-s390x/kvm.c | 118 +++
 8 files changed, 1007 insertions(+), 1 deletion(-)
 create mode 100644 hw/s390x/css.c
 create mode 100644 hw/s390x/css.h
 create mode 100644 target-s390x/ioinst.c
 create mode 100644 target-s390x/ioinst.h

diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs
index dcdcac8..93b41fb 100644
--- a/hw/s390x/Makefile.objs
+++ b/hw/s390x/Makefile.objs
@@ -1,3 +1,4 @@
 obj-y = s390-virtio-bus.o s390-virtio.o
 
 obj-y := $(addprefix ../,$(obj-y))
+obj-y += css.o
diff --git a/hw/s390x/css.c b/hw/s390x/css.c
new file mode 100644
index 000..b9b6e48
--- /dev/null
+++ b/hw/s390x/css.c
@@ -0,0 +1,490 @@
+/*
+ * Channel subsystem base support.
+ *
+ * Copyright 2012 IBM Corp.
+ * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include qemu-thread.h
+#include qemu-queue.h
+#include hw/qdev.h
+#include bitops.h
+#include kvm.h
+#include cpu.h
+#include ioinst.h
+#include css.h
+
+typedef struct ChpInfo {
+uint8_t in_use;
+uint8_t type;
+} ChpInfo;
+
+typedef struct SubchSet {
+SubchDev *sch[MAX_SCHID + 1];
+unsigned long schids_used[BITS_TO_LONGS(MAX_SCHID + 1)];
+unsigned long devnos_used[BITS_TO_LONGS(MAX_SCHID + 1)];
+} SubchSet;
+
+typedef struct CssImage {
+SubchSet *sch_set[MAX_SSID + 1];
+ChpInfo chpids[MAX_CHPID + 1];
+} CssImage;
+
+typedef struct ChannelSubSys {
+CssImage *css[MAX_CSSID + 1];
+uint8_t default_cssid;
+} ChannelSubSys;
+
+static ChannelSubSys *channel_subsys;
+
+int css_create_css_image(uint8_t cssid, bool default_image)
+{
+if (cssid  MAX_CSSID) {
+return -EINVAL;
+}
+if (channel_subsys-css[cssid]) {
+return -EBUSY;
+}
+channel_subsys-css[cssid] = g_try_malloc0(sizeof(CssImage));
+if (!channel_subsys-css[cssid]) {
+return -ENOMEM;
+}
+if (default_image) {
+channel_subsys-default_cssid = cssid;
+}
+s390_new_css_image(cssid, default_image);
+return 0;
+}
+
+static void css_inject_io_interrupt(SubchDev *sch, uint8_t func)
+{
+s390_io_interrupt(sch-cssid, sch-ssid, sch-schid, 
sch-curr_status.scsw,
+  sch-curr_status.pmcw, sch-sense_data, 0,
+  sch-curr_status.pmcw.isc, sch-curr_status.pmcw.intparm,
+  func);
+}
+
+void css_conditional_io_interrupt(SubchDev *sch)
+{
+s390_io_interrupt(sch-cssid, sch-ssid, sch-schid, 
sch-curr_status.scsw,
+  sch-curr_status.pmcw, sch-sense_data, 1,
+  sch-curr_status.pmcw.isc, 
sch-curr_status.pmcw.intparm, 0);
+}
+
+static void sch_handle_clear_func(SubchDev *sch)
+{
+PMCW *p = sch-curr_status.pmcw;
+SCSW *s = sch-curr_status.scsw;
+int path;
+
+/* Path management: In our simple css, we always choose the only path. */
+path = 0x80;
+
+/* Reset values prior to 'issueing the clear signal'. */
+p-lpum = 0;
+p-pom = 0xff;
+s-pno = 0;
+
+/* We always 'attempt to issue the clear signal', and we always succeed. */
+sch-orb = NULL;
+sch-channel_prog = NULL;
+sch-last_cmd = NULL;
+s-actl = ~SCSW_ACTL_CLEAR_PEND;
+s-stctl |= SCSW_STCTL_STATUS_PEND;
+
+s-dstat = 0;
+s-cstat = 0;
+p-lpum = path;
+
+}
+
+static void sch_handle_halt_func(SubchDev *sch)
+{
+
+PMCW *p = sch-curr_status.pmcw;
+SCSW *s = sch-curr_status.scsw;
+int path;
+
+/* Path management: In our simple css, we always choose the only path. */
+path = 0x80;
+
+/* We always 'attempt to issue the halt signal', and we always succeed. */
+sch-orb = NULL;
+sch-channel_prog = NULL;
+sch-last_cmd = NULL;
+s-actl = ~SCSW_ACTL_HALT_PEND;
+s-stctl |= SCSW_STCTL_STATUS_PEND;
+
+if ((s-actl  (SCSW_ACTL_SUBCH_ACTIVE | SCSW_ACTL_DEVICE_ACTIVE)) ||
+!((s-actl  SCSW_ACTL_START_PEND) ||
+  (s-actl  SCSW_ACTL_SUSP))) {
+s-dstat = SCSW_DSTAT_DEVICE_END;
+}
+s-cstat = 0;
+p-lpum = path;
+
+}
+
+static int css_interpret_ccw(SubchDev *sch, CCW1 *ccw)
+{
+int ret;
+bool check_len;
+int len;
+int i;
+
+if 

[PATCH v2 1/5] Update headers for upcoming s390 changes.

2012-09-04 Thread Cornelia Huck
Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1-v2:
- update to latest kvm interface changes

---
 linux-headers/asm-generic/kvm_para.h |  5 +++
 linux-headers/asm-x86/kvm.h  |  1 +
 linux-headers/linux/kvm.h| 80 ++--
 3 files changed, 83 insertions(+), 3 deletions(-)
 create mode 100644 linux-headers/asm-generic/kvm_para.h

diff --git a/linux-headers/asm-generic/kvm_para.h 
b/linux-headers/asm-generic/kvm_para.h
new file mode 100644
index 000..63df88b
--- /dev/null
+++ b/linux-headers/asm-generic/kvm_para.h
@@ -0,0 +1,5 @@
+#ifndef _ASM_GENERIC_KVM_PARA_H
+#define _ASM_GENERIC_KVM_PARA_H
+
+
+#endif
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 246617e..521bf25 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -25,6 +25,7 @@
 #define __KVM_HAVE_DEBUGREGS
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
+#define __KVM_HAVE_READONLY_MEM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 4b9e575..1e87d71 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -101,9 +101,13 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
-/* for kvm_memory_region::flags */
-#define KVM_MEM_LOG_DIRTY_PAGES  1UL
-#define KVM_MEMSLOT_INVALID  (1UL  1)
+/*
+ * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
+ * other bits are reserved for kvm internal use which are defined in
+ * include/linux/kvm_host.h.
+ */
+#define KVM_MEM_LOG_DIRTY_PAGES(1UL  0)
+#define KVM_MEM_READONLY   (1UL  1)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -163,6 +167,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_OSI  18
 #define KVM_EXIT_PAPR_HCALL  19
 #define KVM_EXIT_S390_UCONTROL   20
+#define KVM_EXIT_S390_SCH_IO  21
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
@@ -276,6 +281,20 @@ struct kvm_run {
__u64 ret;
__u64 args[9];
} papr_hcall;
+   /* KVM_EXIT_S390_SCH_IO */
+   struct {
+   __u32 sch_id;
+#define SCH_DO_CSCH 0
+#define SCH_DO_HSCH 1
+#define SCH_DO_SSCH 2
+#define SCH_DO_RSCH 3
+#define SCH_DO_XSCH 4
+   __u8 func;
+   __u8 pad;
+   __u64 orb;
+   __u32 scsw[3];
+   __u32 pmcw[7];
+   } s390_sch_io;
/* Fix the size of the union. */
char padding[256];
};
@@ -388,10 +407,17 @@ struct kvm_s390_psw {
 #define KVM_S390_PROGRAM_INT   0xfffe0001u
 #define KVM_S390_SIGP_SET_PREFIX   0xfffe0002u
 #define KVM_S390_RESTART   0xfffe0003u
+#define KVM_S390_MCHK  0xfffe1000u
 #define KVM_S390_INT_VIRTIO0x2603u
 #define KVM_S390_INT_SERVICE   0x2401u
 #define KVM_S390_INT_EMERGENCY 0x1201u
 #define KVM_S390_INT_EXTERNAL_CALL 0x1202u
+#define KVM_S390_INT_IO(ai,cssid,ssid,schid)   \
+   (((schid)) |   \
+((ssid)  16) |  \
+((cssid)  18) | \
+((ai)  26))
+
 
 struct kvm_s390_interrupt {
__u32 type;
@@ -473,6 +499,45 @@ struct kvm_ppc_smmu_info {
struct kvm_ppc_one_seg_page_size sps[KVM_PPC_PAGE_SIZES_MAX_SZ];
 };
 
+/* for KVM_S390_CSS_NOTIFY */
+struct kvm_css_notify {
+   __u8 cssid;
+   __u8 ssid;
+   __u16 schid;
+   __u32 scsw[3];
+   __u32 pmcw[7];
+   __u8 sense_data[32];
+   __u8 unsolicited;
+   __u8 func;
+};
+
+/* for KVM_S390_CCW_HOTPLUG */
+struct kvm_s390_sch_info {
+   __u8 cssid;
+   __u8 ssid;
+   __u16 schid;
+   __u16 devno;
+   __u32 schib[12];
+   int hotplugged;
+   int add;
+   int virtual;
+};
+
+/* for KVM_S390_CHP_HOTPLUG */
+struct kvm_s390_chp_info {
+   __u8 cssid;
+   __u8 chpid;
+   __u8 type;
+   int add;
+   int virtual;
+};
+
+/* for KVM_S390_ADD_CSS */
+struct kvm_s390_css_info {
+   __u8 cssid;
+   __u8 default_image;
+};
+
 #define KVMIO 0xAE
 
 /* machine type bits, to be used as argument to KVM_CREATE_VM */
@@ -618,6 +683,10 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#ifdef __KVM_HAVE_READONLY_MEM
+#define KVM_CAP_READONLY_MEM 81
+#endif
+#define KVM_CAP_S390_CSS_SUPPORT 82
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -831,6 +900,11 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_SMMU_INFO_IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
 /* Available with KVM_CAP_PPC_ALLOC_HTAB */
 #define KVM_PPC_ALLOCATE_HTAB_IOWR(KVMIO, 0xa7, __u32)
+/* Available with 

[RFC PATCH v2 0/5] qemu: s390: virtual css and virtio-ccw.

2012-09-04 Thread Cornelia Huck
Hi,

here's the second version of virtual channel I/O and the new virtio-ccw
transport.

Changes to the first version include coding style fixes, changes in the
organization of objects (not quite finished yet), adaptions to changes
in the kernel interface and implementation of the improved virtio-ccw
primitives.

Cornelia Huck (5):
  Update headers for upcoming s390 changes.
  s390: Virtual channel subsystem support.
  s390: Add new channel I/O based virtio transport.
  s390: Virtual channel subsystem support for !KVM.
  [HACK] Handle multiple virtio aliases.

 blockdev.c   |6 +-
 hw/qdev-monitor.c|   90 ++-
 hw/s390-virtio.c |  277 ++--
 hw/s390x/Makefile.objs   |2 +
 hw/s390x/css.c   | 1280 ++
 hw/s390x/css.h   |   89 +++
 hw/s390x/virtio-ccw.c|  875 +++
 hw/s390x/virtio-ccw.h|   79 +++
 linux-headers/asm-generic/kvm_para.h |5 +
 linux-headers/asm-x86/kvm.h  |1 +
 linux-headers/linux/kvm.h|   80 ++-
 target-s390x/Makefile.objs   |2 +-
 target-s390x/cpu.h   |  277 
 target-s390x/helper.c|  140 
 target-s390x/ioinst.c|  734 +++
 target-s390x/ioinst.h|  206 ++
 target-s390x/kvm.c   |  282 +++-
 target-s390x/op_helper.c |   22 +-
 vl.c |7 +-
 19 files changed, 4313 insertions(+), 141 deletions(-)
 create mode 100644 hw/s390x/css.c
 create mode 100644 hw/s390x/css.h
 create mode 100644 hw/s390x/virtio-ccw.c
 create mode 100644 hw/s390x/virtio-ccw.h
 create mode 100644 linux-headers/asm-generic/kvm_para.h
 create mode 100644 target-s390x/ioinst.c
 create mode 100644 target-s390x/ioinst.h

-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] s390/kvm: Handle hosts not supporting s390-virtio.

2012-09-04 Thread Cornelia Huck
Running under a kvm host does not necessarily imply the presence of
a page mapped above the main memory with the virtio information;
however, the code includes a hard coded access to that page.

Instead, check for the presence of the page and exit gracefully
before we hit an addressing exception if it does not exist.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---

Changes v1-v2:

- check for the presence of the patch with lura
- reorder init sequence
- comments

---
 drivers/s390/kvm/kvm_virtio.c | 39 +++
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 47cccd5..76b95f3 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -419,6 +419,26 @@ static void kvm_extint_handler(struct ext_code ext_code,
 }
 
 /*
+ * For s390-virtio, we expect a page above main storage containing
+ * the virtio configuration. Try to actually load from this area
+ * in order to figure out if the host provides this page.
+ */
+static int __init test_devices_support(unsigned long addr)
+{
+   int ret = -EIO;
+
+   asm volatile(
+   0: lura0,%1\n
+   1: xgr %0,%0\n
+   2:\n
+   EX_TABLE(0b,2b)
+   EX_TABLE(1b,2b)
+   : +d (ret)
+   : a (addr)
+   : 0, cc);
+   return ret;
+}
+/*
  * Init function for virtio
  * devices are in a single page above top of normal mem
  */
@@ -429,21 +449,24 @@ static int __init kvm_devices_init(void)
if (!MACHINE_IS_KVM)
return -ENODEV;
 
+   if (test_devices_support(real_memory_size)  0)
+   /* No error. */
+   return 0;
+
+   rc = vmem_add_mapping(real_memory_size, PAGE_SIZE);
+   if (rc)
+   return rc;
+
+   kvm_devices = (void *) real_memory_size;
+
kvm_root = root_device_register(kvm_s390);
if (IS_ERR(kvm_root)) {
rc = PTR_ERR(kvm_root);
printk(KERN_ERR Could not register kvm_s390 root device);
+   vmem_remove_mapping(real_memory_size, PAGE_SIZE);
return rc;
}
 
-   rc = vmem_add_mapping(real_memory_size, PAGE_SIZE);
-   if (rc) {
-   root_device_unregister(kvm_root);
-   return rc;
-   }
-
-   kvm_devices = (void *) real_memory_size;
-
INIT_WORK(hotplug_work, hotplug_devices);
 
service_subclass_irq_register();
-- 
1.7.11.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] s390: Make some css-related structures usable by non-cio code.

2012-09-04 Thread Cornelia Huck
kvm will need to use some css-related structures (pmcw, schib, orb),
so let's move the definitions from drivers/s390/cio/ to include/asm/.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/include/asm/orb.h   | 69 +++
 arch/s390/include/asm/schib.h | 52 
 drivers/s390/cio/cio.h| 46 +
 drivers/s390/cio/io_sch.h |  2 +-
 drivers/s390/cio/ioasm.h  |  2 +-
 drivers/s390/cio/orb.h| 67 -
 6 files changed, 124 insertions(+), 114 deletions(-)
 create mode 100644 arch/s390/include/asm/orb.h
 create mode 100644 arch/s390/include/asm/schib.h
 delete mode 100644 drivers/s390/cio/orb.h

diff --git a/arch/s390/include/asm/orb.h b/arch/s390/include/asm/orb.h
new file mode 100644
index 000..ca5d255
--- /dev/null
+++ b/arch/s390/include/asm/orb.h
@@ -0,0 +1,69 @@
+/*
+ * Orb related data structures.
+ *
+ * Copyright IBM Corp. 2007, 2011
+ *
+ * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
+ *   Peter Oberparleiter peter.oberparlei...@de.ibm.com
+ *   Sebastian Ott seb...@linux.vnet.ibm.com
+ */
+
+#ifndef S390_ORB_H
+#define S390_ORB_H
+
+#include linux/types.h
+
+/*
+ * Command-mode operation request block
+ */
+struct cmd_orb {
+   u32 intparm;/* interruption parameter */
+   u32 key:4;  /* flags, like key, suspend control, etc. */
+   u32 spnd:1; /* suspend control */
+   u32 res1:1; /* reserved */
+   u32 mod:1;  /* modification control */
+   u32 sync:1; /* synchronize control */
+   u32 fmt:1;  /* format control */
+   u32 pfch:1; /* prefetch control */
+   u32 isic:1; /* initial-status-interruption control */
+   u32 alcc:1; /* address-limit-checking control */
+   u32 ssic:1; /* suppress-suspended-interr. control */
+   u32 res2:1; /* reserved */
+   u32 c64:1;  /* IDAW/QDIO 64 bit control  */
+   u32 i2k:1;  /* IDAW 2/4kB block size control */
+   u32 lpm:8;  /* logical path mask */
+   u32 ils:1;  /* incorrect length */
+   u32 zero:6; /* reserved zeros */
+   u32 orbx:1; /* ORB extension control */
+   u32 cpa;/* channel program address */
+}  __packed __aligned(4);
+
+/*
+ * Transport-mode operation request block
+ */
+struct tm_orb {
+   u32 intparm;
+   u32 key:4;
+   u32:9;
+   u32 b:1;
+   u32:2;
+   u32 lpm:8;
+   u32:7;
+   u32 x:1;
+   u32 tcw;
+   u32 prio:8;
+   u32:8;
+   u32 rsvpgm:8;
+   u32:8;
+   u32:32;
+   u32:32;
+   u32:32;
+   u32:32;
+}  __packed __aligned(4);
+
+union orb {
+   struct cmd_orb cmd;
+   struct tm_orb tm;
+}  __packed __aligned(4);
+
+#endif /* S390_ORB_H */
diff --git a/arch/s390/include/asm/schib.h b/arch/s390/include/asm/schib.h
new file mode 100644
index 000..87d7403
--- /dev/null
+++ b/arch/s390/include/asm/schib.h
@@ -0,0 +1,52 @@
+#ifndef _ASM_S390_SCHIB_H_
+#define _ASM_S390_SCHIB_H_
+
+#include asm/types.h
+
+#include asm/scsw.h
+/*
+ * path management control word
+ */
+struct pmcw {
+   u32 intparm;/* interruption parameter */
+   u32 qf:1;   /* qdio facility */
+   u32 w:1;
+   u32 isc:3;  /* interruption sublass */
+   u32 res5:3; /* reserved zeros */
+   u32 ena:1;  /* enabled */
+   u32 lm:2;   /* limit mode */
+   u32 mme:2;  /* measurement-mode enable */
+   u32 mp:1;   /* multipath mode */
+   u32 tf:1;   /* timing facility */
+   u32 dnv:1;  /* device number valid */
+   u32 dev:16; /* device number */
+   u8  lpm;/* logical path mask */
+   u8  pnom;   /* path not operational mask */
+   u8  lpum;   /* last path used mask */
+   u8  pim;/* path installed mask */
+   u16 mbi;/* measurement-block index */
+   u8  pom;/* path operational mask */
+   u8  pam;/* path available mask */
+   u8  chpid[8];   /* CHPID 0-7 (if available) */
+   u32 unused1:8;  /* reserved zeros */
+   u32 st:3;   /* subchannel type */
+   u32 unused2:18; /* reserved zeros */
+   u32 mbfc:1; /* measurement block format control */
+   u32 xmwme:1;/* extended measurement word mode enable */
+   u32 csense:1;   /* concurrent sense; can be enabled ...*/
+   /*  ... per MSCH, however, if facility */
+   /*  ... is not installed, this results */
+   /*  ... in an operand exception.   */
+} __packed;
+
+/*
+ * subchannel information block
+ */
+struct schib {
+   

Error: Not supported image type twoGbMaxExtentFlat. - problems using virt-convert

2012-09-04 Thread Lentes, Bernd
Hi,

i want to convert a sles 11 sp2 64bit system (running on VMWare Server 1.09) to 
libvirt format. Host OS is SLES 11 SP2 64bit.
I tried virt-convert --os-variant=sles11 sles_11_vmx/ sles_11_kvm/ .

This is what i got:
Generating output in 'virt-image' format to sles_11_kvm//
Converting disk 'tomcat_6.vmdk' to type raw...
ERRORCouldn't convert disks: Disk conversion failed with exit status 1: 
VMDK: Not supported image type twoGbMaxExtentFlat.
qemu-img: Could not open '/var/lib/kvm/images/sles_11_vmx/tomcat_6.vmdk': 
Operation not supported
qemu-img: Could not open '/var/lib/kvm/images/sles_11_vmx/tomcat_6.vmdk'

It seems that virt-convert does not like the 2GB files form VMWare Server.

How can i convert my system from VMWare Server 1.09 to libvirt format ?


Thanks for any hints.


Bernd

--
Bernd Lentes

Systemadministration
Institut für Entwicklungsgenetik
Gebäude 35.34 - Raum 208
HelmholtzZentrum münchen
bernd.len...@helmholtz-muenchen.de
phone: +49 89 3187 1241
fax:   +49 89 3187 3826
http://www.helmholtz-muenchen.de/idg

Wir sollten nicht den Tod fürchten, sondern
das schlechte Leben

Helmholtz Zentrum München
Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Ingolstädter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
Registergericht: Amtsgericht München HRB 6466
USt-IdNr: DE 129521671
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] add acpi pmtimer support

2012-09-04 Thread Avi Kivity
On 09/02/2012 11:42 PM, Kevin O'Connor wrote:
 On Tue, Aug 14, 2012 at 07:29:19AM +0200, Gerd Hoffmann wrote:
 This patch makes seabios use the acpi pmtimer instead of tsc for
 timekeeping.  The pmtimer has a fixed frequency and doesn't need
 calibration, thus it doesn't suffer from calibration errors due to a
 loaded host machine.
 
 The patch looks okay to me, but is it still needed?  (I recall seeing
 something on the kvm list about a bug fix to the main timer.)

Timing will always be fragile in a vm, so I think this can make things
more robust.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-04 Thread Avi Kivity
On 08/31/2012 12:56 PM, Michael S. Tsirkin wrote:
 On Fri, Aug 31, 2012 at 11:36:07AM +0200, Sasha Levin wrote:
 On 08/30/2012 03:38 PM, Michael S. Tsirkin wrote:
  +static unsigned int indirect_alloc_thresh = 16;
  Why 16?  Please make is MAX_SG + 1 this makes some sense.
 
 Wouldn't MAX_SG mean we always allocate from the cache? Isn't the memory 
 waste
 too big in this case?
 
 Sorry. I really meant MAX_SKB_FRAGS + 1. MAX_SKB_FRAGS is 17 so gets us
 threshold of 18. It is less than the size of an skb+shinfo itself so -
 does it look too big to you? Also why do you think 16 is not too big but
 18 is?  If there's a reason then I am fine with 16 too but then please
 put it in code comment near where the value is set.
 
 Yes this means virtio net always allocates from cache
 but this is a good thing, isn't it? Gets us more consistent
 performance.

kmalloc() also goes to a cache.  Is there a measurable difference?

Ugh, there's an ugly loop in __find_general_cachep(), which really wants
to be replaced with fls().

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-04 Thread Avi Kivity
On 09/04/2012 07:34 PM, Avi Kivity wrote:
 On 08/31/2012 12:56 PM, Michael S. Tsirkin wrote:
 On Fri, Aug 31, 2012 at 11:36:07AM +0200, Sasha Levin wrote:
 On 08/30/2012 03:38 PM, Michael S. Tsirkin wrote:
  +static unsigned int indirect_alloc_thresh = 16;
  Why 16?  Please make is MAX_SG + 1 this makes some sense.
 
 Wouldn't MAX_SG mean we always allocate from the cache? Isn't the memory 
 waste
 too big in this case?
 
 Sorry. I really meant MAX_SKB_FRAGS + 1. MAX_SKB_FRAGS is 17 so gets us
 threshold of 18. It is less than the size of an skb+shinfo itself so -
 does it look too big to you? Also why do you think 16 is not too big but
 18 is?  If there's a reason then I am fine with 16 too but then please
 put it in code comment near where the value is set.
 
 Yes this means virtio net always allocates from cache
 but this is a good thing, isn't it? Gets us more consistent
 performance.
 
 kmalloc() also goes to a cache.  Is there a measurable difference?
 
 Ugh, there's an ugly loop in __find_general_cachep(), which really wants
 to be replaced with fls().
 

Actually, not, as the loop will be very short for small sizes.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Error: Not supported image type twoGbMaxExtentFlat. - problems using virt-convert

2012-09-04 Thread Brian Jackson
On Tuesday, September 04, 2012 11:26:49 AM Lentes, Bernd wrote:
 Hi,
 
 i want to convert a sles 11 sp2 64bit system (running on VMWare Server
 1.09) to libvirt format. Host OS is SLES 11 SP2 64bit. I tried
 virt-convert --os-variant=sles11 sles_11_vmx/ sles_11_kvm/ .
 
 This is what i got:
 Generating output in 'virt-image' format to sles_11_kvm//
 Converting disk 'tomcat_6.vmdk' to type raw...
 ERRORCouldn't convert disks: Disk conversion failed with exit status 1:
 VMDK: Not supported image type twoGbMaxExtentFlat. qemu-img: Could not
 open '/var/lib/kvm/images/sles_11_vmx/tomcat_6.vmdk': Operation not
 supported qemu-img: Could not open
 '/var/lib/kvm/images/sles_11_vmx/tomcat_6.vmdk'
 
 It seems that virt-convert does not like the 2GB files form VMWare Server.
 
 How can i convert my system from VMWare Server 1.09 to libvirt format ?

I don't know what libvirt format is, but to get a raw file if it's Linux:
vmware-vdiskmanager -r source.vmdk -t 2 dest.raw

At least I think that should work... That might still be a vmdk file, but it 
should work with qemu-img. You might try -t 0 if you are short on space.


 
 
 Thanks for any hints.
 
 
 Bernd
 
 --
 Bernd Lentes
 
 Systemadministration
 Institut für Entwicklungsgenetik
 Gebäude 35.34 - Raum 208
 HelmholtzZentrum münchen
 bernd.len...@helmholtz-muenchen.de
 phone: +49 89 3187 1241
 fax:   +49 89 3187 3826
 http://www.helmholtz-muenchen.de/idg
 
 Wir sollten nicht den Tod fürchten, sondern
 das schlechte Leben
 
 Helmholtz Zentrum München
 Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
 Ingolstädter Landstr. 1
 85764 Neuherberg
 www.helmholtz-muenchen.de
 Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
 Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
 Registergericht: Amtsgericht München HRB 6466
 USt-IdNr: DE 129521671
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-04 Thread Michael S. Tsirkin
On Tue, Sep 04, 2012 at 07:34:19PM +0300, Avi Kivity wrote:
 On 08/31/2012 12:56 PM, Michael S. Tsirkin wrote:
  On Fri, Aug 31, 2012 at 11:36:07AM +0200, Sasha Levin wrote:
  On 08/30/2012 03:38 PM, Michael S. Tsirkin wrote:
   +static unsigned int indirect_alloc_thresh = 16;
   Why 16?  Please make is MAX_SG + 1 this makes some sense.
  
  Wouldn't MAX_SG mean we always allocate from the cache? Isn't the memory 
  waste
  too big in this case?
  
  Sorry. I really meant MAX_SKB_FRAGS + 1. MAX_SKB_FRAGS is 17 so gets us
  threshold of 18. It is less than the size of an skb+shinfo itself so -
  does it look too big to you? Also why do you think 16 is not too big but
  18 is?  If there's a reason then I am fine with 16 too but then please
  put it in code comment near where the value is set.
  
  Yes this means virtio net always allocates from cache
  but this is a good thing, isn't it? Gets us more consistent
  performance.
 
 kmalloc() also goes to a cache.  Is there a measurable difference?

Yes see 0/2 and followup discussion.

 Ugh, there's an ugly loop in __find_general_cachep(), which really wants
 to be replaced with fls().
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: expanding virtual disk based on lvm

2012-09-04 Thread Ross Boylan
On Tue, 2012-09-04 at 15:53 +0300, Avi Kivity wrote:
 On 08/28/2012 11:26 PM, Ross Boylan wrote:
  My vm launches with -hda /dev/turtle/VD0 -hdb /dev/turtle/VD1, where VD0
  and VD1 are lvm logical volumes.  I used lvextend to expand them, but
  the VM, started after the expansion, does not seem to see the extra
  space.
  
  What do I need to so that the space will be recognized?
 
 IDE (-hda) does not support rechecking the size.  Try booting with
 virtio-blk.  Additionally, you may need to request the guest to rescan
 the drive (no idea how to do that).  Nor am I sure whether qemu will
 emulate the request correctly.
 
Thank you for the suggestion.

I think the physical recognition of the new virtual disk size was
accomplished when I restarted the VM, without any other steps.  I've had
plenty of other problems, but I think at the VM level things are good.

I needed to manually resize the last partition with fdisk.  None of the
other tools (cfdisk, parted, gparted) would manipulate the partition
table, for reasons that became apparent.

The resized partitions were in an mdadm RAID1 array.  When I expanded
them it meant the raid superblock was no longer found (theory), and the
RAID could not be reassembled (fact).  I've attempted to fix that by
recreating the array, but mdadm is refusing to use the UUID I specify,
instead modifying it with the localhost name.  The virtual disks are for
a Debian lenny VM, but the only other spare VM around was squeeze, and
mdadm in squeeze does the localhost rewriting.

By the way, it's really great to have a VM's as a testing area in which
I can discover these problems without trashing my real system.  Thanks
to everyone who made it possible.

Ross Boylan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 4/4] kvm: i386: Add classic PCI device assignment

2012-09-04 Thread Blue Swirl
On Tue, Sep 4, 2012 at 8:32 AM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 10:32 PM, Blue Swirl wrote:
 On Mon, Sep 3, 2012 at 4:14 PM, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 11:27 AM, Markus Armbruster wrote:

 I don't see a point in making contributors avoid non-problems that might
 conceivably become trivial problems some day.  Especially when there's
 no automated help with the avoiding.

 -Wpointer-arith

 +1

 FWIW, I'm not in favour of enabling it, just pointing out that it
 exists.  In general I prefer avoiding unnecessary use of extensions, but
 in this case the extension is trivial and improves readability.

Void pointers are not so type safe as uint8_t pointers. There's also
little difference in readability between those in my opinion.



 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support

2012-09-04 Thread Nicholas A. Bellinger
On Tue, 2012-09-04 at 08:46 +0200, Paolo Bonzini wrote:
 Il 04/09/2012 04:21, Nicholas A. Bellinger ha scritto:
  @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
  *vscsi, void *buf)
 struct virtio_scsi_cmd *cmd = buf;
 struct scsi_cmnd *sc = cmd-sc;
 struct virtio_scsi_cmd_resp *resp = cmd-resp.cmd;
  +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +
  +  atomic_dec(tgt-reqs);
   
  
  As tgt-tgt_lock is taken in virtscsi_queuecommand_multi() before the
  atomic_inc_return(tgt-reqs) check, it seems like using atomic_dec() w/o
  smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
  accessors properly, no..?
 
 No, only a single thing is being accessed, and there is no need to
 order the decrement with respect to preceding or subsequent accesses to
 other locations.
 
 In other words, tgt-reqs is already synchronized with itself, and that
 is enough.
 
 (Besides, on x86 smp_mb__after_atomic_dec is a nop).
 

So the implementation detail wrt to requests to the same target being
processed in FIFO ordering + only being able to change the queue when no
requests are pending helps understand this code more.  Thanks for the
explanation on that bit..

However, it's still my understanding that the use of atomic_dec() in the
completion path mean that smp_mb__after_atomic_dec() is a requirement to
be proper portable atomic.hcode, no..?  Otherwise tgt-regs should be
using something other than an atomic_t, right..?

  +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
  + struct scsi_cmnd *sc)
  +{
  +  struct virtio_scsi *vscsi = shost_priv(sh);
  +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +  unsigned long flags;
  +  u32 queue_num;
  +
  +  /* Using an atomic_t for tgt-reqs lets the virtqueue handler
  +   * decrement it without taking the spinlock.
  +   */
  +  spin_lock_irqsave(tgt-tgt_lock, flags);
  +  if (atomic_inc_return(tgt-reqs) == 1) {
  +  queue_num = smp_processor_id();
  +  while (unlikely(queue_num = vscsi-num_queues))
  +  queue_num -= vscsi-num_queues;
  +  tgt-req_vq = vscsi-req_vqs[queue_num];
  +  }
  +  spin_unlock_irqrestore(tgt-tgt_lock, flags);
  +  return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +
  
  The extra memory barriers to get this right for the current approach are
  just going to slow things down even more for virtio-scsi-mq..
 
 virtio-scsi multiqueue has a performance benefit up to 20% (for a single
 LUN) or 40% (on overall bandwidth across multiple LUNs).  I doubt that a
 single memory barrier can have that much impact. :)
 

I've no doubt that this series increases the large block high bandwidth
for virtio-scsi, but historically that has always been the easier
workload to scale.  ;)

 The way to go to improve performance even more is to add new virtio APIs
 for finer control of the usage of the ring.  These should let us avoid
 copying the sg list and almost get rid of the tgt_lock; even though the
 locking is quite efficient in virtio-scsi (see how tgt_lock and vq_lock
 are pipelined so as to overlap the preparation of two requests), it
 should give a nice improvement and especially avoid a kmalloc with small
 requests.  I may have some time for it next month.
 
  Jen's approach is what we will ultimately need to re-architect in SCSI
  core if we're ever going to move beyond the issues of legacy host_lock,
  so I'm wondering if maybe this is the direction that virtio-scsi-mq
  needs to go in as well..?
 
 We can see after the block layer multiqueue work goes in...  I also need
 to look more closely at Jens's changes.
 

Yes, I think Jen's new approach is providing some pretty significant
gains for raw block drivers with extremly high packet (small block
random I/O) workloads, esp with hw block drivers that support genuine mq
with hw num_queues  1.

He also has virtio-blk converted to run in num_queues=1 mode.

 Have you measured the host_lock to be a bottleneck in high-iops
 benchmarks, even for a modern driver that does not hold it in
 queuecommand?  (Certainly it will become more important as the
 virtio-scsi queuecommand becomes thinner and thinner).

This is exactly why it would make such a good vehicle to re-architect
SCSI core.  I'm thinking it can be the first sw LLD we attempt to get
running on an (currently) future scsi-mq prototype.

   If so, we can
 start looking at limiting host_lock usage in the fast path.
 

That would be a good incremental step for SCSI core, but I'm not sure
that that we'll be able to scale compared to blk-mq without a
new-approach for sw/hw LLDs along the lines of what Jen's is doing.

 BTW, supporting this in tcm-vhost should be quite trivial, as all the
 request queues are the same and all serialization is done in the
 virtio-scsi driver.
 

Looking forward to that too..  ;)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of 

Re: [Qemu-devel] [PATCH 4/4] kvm: i386: Add classic PCI device assignment

2012-09-04 Thread Anthony Liguori
Andreas Färber afaer...@suse.de writes:

 Am 28.08.2012 14:57, schrieb Anthony Liguori:
 Andreas Färber afaer...@suse.de writes:
 
 Hi,

 Am 27.08.2012 08:28, schrieb Jan Kiszka:
 From: Jan Kiszka jan.kis...@siemens.com

 This adds PCI device assignment for i386 targets using the classic KVM
 interfaces. This version is 100% identical to what is being maintained
 in qemu-kvm for several years and is supported by libvirt as well. It is
 expected to remain relevant for another couple of years until kernels
 without full-features and performance-wise equivalent VFIO support are
 obsolete.

 A refactoring to-do that should be done in-tree is to model MSI and
 MSI-X support via the generic PCI layer, similar to what VFIO is already
 doing for MSI-X. This should improve the correctness and clean up the
 code from duplicate logic.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/kvm/Makefile.objs |2 +-
  hw/kvm/pci-assign.c  | 1929 
 ++
  2 files changed, 1930 insertions(+), 1 deletions(-)
  create mode 100644 hw/kvm/pci-assign.c
 [...]
 diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
 new file mode 100644
 index 000..9cce02c
 --- /dev/null
 +++ b/hw/kvm/pci-assign.c
 @@ -0,0 +1,1929 @@
 +/*
 + * Copyright (c) 2007, Neocleus Corporation.
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms and conditions of the GNU General Public License,
 + * version 2, as published by the Free Software Foundation.

 The downside of accepting this into qemu.git is that it gets us a huge
 blob of GPLv2-only code without history of contributors for GPLv2+
 relicensing...
 
 That is 100% okay.

 Why? The way this is being submitted I don't see why we should treat
 Jan's patch any different from a patch by IBM or Samsung where we've
 asked folks to fix the license to comply with what I thought was our new
 policy (it does not even contain a from-x-on-GPLv2+ notice).

Asking is one thing.  Requiring is another.

I would prefer that people submitted GPLv2+, but I don't think it should
be a hard requirement.  It means, among other things, that we cannot
accept most code that originates from the Linux kernel.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-04 Thread Hao, Xudong
 -Original Message-
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Monday, September 03, 2012 5:23 PM
 To: Hao, Xudong
 Cc: Roedel, Joerg; kvm@vger.kernel.org; Zhang, Xiantao
 Subject: Re: [PATCH] kvm/fpu: Enable fully eager restore kvm FPU
 
 On 08/23/2012 11:51 AM, Hao, Xudong wrote:
  -Original Message-
  From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
  Behalf Of Avi Kivity
  Sent: Monday, August 20, 2012 6:43 PM
  To: Roedel, Joerg
  Cc: Hao, Xudong; kvm@vger.kernel.org; Zhang, Xiantao
  Subject: Re: [PATCH] kvm/fpu: Enable fully eager restore kvm FPU
 
  On 08/20/2012 01:14 PM, Roedel, Joerg wrote:
   On Mon, Aug 20, 2012 at 01:08:14PM +0300, Avi Kivity wrote:
   On 08/20/2012 12:24 PM, Roedel, Joerg wrote:
  
   So it was broken all along?  Yikes.
  
   There is no LWP support in the kernel and thus KVM can't expose it to
   guests. So for now nothing should be broken, no?
 
  Oh, we mask out xcr0 bits not supported by the host.
 
  So it's broken in another way: it isn't exposed.  Pity, it's such a nice
  feature.
 
 
  Hi, Avi/Joerg
 
  What's the decision for it? I don't understand LWP, so how about this patch?
 
 It's fine (Joerg can send the LWP change), but there was a truncation
 issue that needs fixing, no?
 

Yes, I think you means to expand KVM_XSTATE_LAZY to 64-bits, I'll send another 
version patch.
 
Thanks,
-Xudong
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-04 Thread Xudong Hao
Enable KVM FPU fully eager restore, if there is other FPU state which isn't
tracked by CR0.TS bit.

Changes from v1:
Expand KVM_XSTATE_LAZY to 64 bits before negating it.

Signed-off-by: Xudong Hao xudong@intel.com
---
 arch/x86/include/asm/kvm.h |4 
 arch/x86/kvm/x86.c |   13 -
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
index 521bf25..4c27056 100644
--- a/arch/x86/include/asm/kvm.h
+++ b/arch/x86/include/asm/kvm.h
@@ -8,6 +8,8 @@
 
 #include linux/types.h
 #include linux/ioctl.h
+#include asm/user.h
+#include asm/xsave.h
 
 /* Select x86 specific features in linux/kvm.h */
 #define __KVM_HAVE_PIT
@@ -30,6 +32,8 @@
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
 
+#define KVM_XSTATE_LAZY(XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+
 struct kvm_memory_alias {
__u32 slot;  /* this has a different namespace than memory slots */
__u32 flags;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 20f2266..a632042 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5969,7 +5969,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
vcpu-guest_fpu_loaded = 0;
fpu_save_init(vcpu-arch.guest_fpu);
++vcpu-stat.fpu_reload;
-   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
+   /*
+* Currently KVM trigger FPU restore by #NM (via CR0.TS),
+* till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked
+* by TS bit, there might be other FPU state is not tracked
+* by TS bit. Here it only make FPU deactivate request and do 
+* FPU lazy restore for these cases: 1)xsave isn't enabled 
+* in guest, 2)all guest FPU states can be tracked by TS bit.
+* For others, doing fully FPU eager restore.
+*/
+   if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) ||
+   !(vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY)))
+   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);
 }
 
-- 
1.5.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] Making KVM_GET_ONE_REG/KVM_SET_ONE_REG generic.

2012-09-04 Thread Alexander Graf


On 04.09.2012, at 07:48, Avi Kivity a...@redhat.com wrote:

 On 09/03/2012 03:33 PM, Rusty Russell wrote:
 Avi Kivity a...@redhat.com writes:
 On 09/01/2012 03:35 PM, Rusty Russell wrote:
 Passing an address in a struct is pretty bad, since it involves
 compatibility wrappers.  
 
 Right, some s390 thing.
 
 Err, no, i386 on x86-64, or ppc32 on ppc64, or arm on arm64
 
 Any time you put a pointer in a structure which is exposed to userspace,
 you have to deal with this.
 
 Not is you pack the pointer in a __u64, which is what we do to preserve
 padding.  Then it is only s390 which needs extra love.

I doubt that anyone wants to run 31-bit user space on an s390x system. In fact, 
I wouldn't be surprised if exactly that case is broken already.

 
 I don't think that is what makes the API hard
 to use.
 
 What is it then?  I forgot what the original complaints/complainers were.
 
 I have no idea, since I didn't hear the complaints.  But any non-fixed
 size array has issues in C; there's not much we can do about it.
 
 x86 manages this fine for msrs, and I didn't have a problem using it for
 my test programs.  That's the limit of my experience, however.
 
 Another option is to use the size parameter from the ioctl.  It just
 sits there doing nothing.

It would require quite a bunch of changes throughout the stack. Even in user 
space, like strace...

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] Making KVM_GET_ONE_REG/KVM_SET_ONE_REG generic.

2012-09-04 Thread Alexander Graf

On 04.09.2012, at 09:31, Peter Maydell wrote:

 On 1 September 2012 13:28, Rusty Russell ru...@rustcorp.com.au wrote:
 Rusty Russell (8):
  KVM: ARM: Fix walk_msrs()
  KVM: Move KVM_SET_ONE_REG/KVM_GET_ONE_REG to generic code.
  KVM: Add KVM_REG_SIZE() helper.
  KVM: ARM: use KVM_SET_ONE_REG/KVM_GET_ONE_REG.
  KVM: Add KVM_VCPU_GET_REG_LIST.
  KVM: ARM: Use KVM_VCPU_GET_REG_LIST.
  KVM: ARM: Access all registers via KVM_GET_ONE_REG/KVM_SET_ONE_REG.
  KVM ARM: Update api.txt
 
 So I was thinking about this, and I remembered that the SET_ONE_REG/
 GET_ONE_REG API has userspace pass a pointer to the variable the
 kernel should read/write (unlike the _MSR x86 ioctls, where the
 actual data value is sent back and forth in the struct). Further,
 the kernel only writes a data value of the size of the register
 (rather than always reading/writing a uint64_t).
 
 This is a problem because it means userspace needs to know the
 size of each register, and the kernel doesn't provide any way
 to determine the size.

It does, as it's encoded in the register ID.

 This defeats the idea that userspace should
 be able to migrate kernel register state without having to know
 the semantics of all the registers involved.
 
 Possible solutions:
 * switch GET/SET_ONE_REG to just passing data, same as the MSR ioctls
 * switch GET/SET_ONE_REG to always writing 64 bits regardless of
   actual guest register width
 * make GET_REG_LIST return register width as well as index
 
 Personally I would really prefer the MSR-style pass the data.

Well, the reason I put dynamic sizes in there is that we already have very big 
register sizes on x86 (265 bits iirc), and so far chances are that it'll rather 
get bigger than smaller over time. So I would really like to keep the size 
encoding in the register id so that we can support big multimedia registers 
later on.

 Otherwise I'm going to end up constructing something like
 uint64_t actual_values[]
 struct kvm_one_reg regs[]
 
 where regs[x].addr = actual_values[x] for all x. Which seems
 like unnecessary indirection really :-)
 
 I could live with always read/write 64 bits. I definitely don't
 want to have to deal with matching up register widths to accesses
 in userspace, please.

If I understood Rusty correctly, he wanted to do exactly that. Just make all 
the ARM registers be 64-bit wide, so that you can just keep them all as 
uint64_t in QEMU's env and then put env's pointers into the ONE_REG ioctl.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: MMU: Tracking guest writes through EPT entries ?

2012-09-04 Thread Hugo
On Mon, Sep 3, 2012 at 1:11 AM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 On 09/03/2012 10:09 AM, Hugo wrote:
 On Sun, Sep 2, 2012 at 8:29 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote:
 On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/31/2012 02:59 AM, Hugo wrote:
 On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 On 08/28/2012 11:30 AM, Felix wrote:
 Xiao Guangrong xiaoguangrong at linux.vnet.ibm.com writes:


 On 07/31/2012 01:18 AM, Sunil wrote:
 Hello List,

 I am a KVM newbie and studying KVM mmu code.

 On the existing guest, I am trying to track all guest writes by
 marking page table entry as read-only in EPT entry [ I am using Intel
 machine with vmx and ept support ]. Looks like EPT support re-uses
 shadow page table(SPT) code and hence some of SPT routines.

 I was thinking of below possible approach. Use pte_list_walk() to
 traverse through list of sptes and use mmu_spte_update()  to flip the
 PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
 but on separate lists (based on gfn, page level, memory_slot). So,
 recording all the faulted guest GFN and then using above method work 
 ?


 There are two ways to write-protect all sptes:
 - use kvm_mmu_slot_remove_write_access() on all memslots
 - walk the shadow page cache to get the shadow pages in the highest 
 level
   (level = 4 on EPT), then write-protect its entries.

 If you just want to do it for the specified gfn, you can use
 rmap_write_protect().

 Just inquisitive, what is your purpose? :)

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majordomo at vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 Hi, Guangrong,

 I have done similar things like Sunil did. Simply for study purpose. 
 However, I
 found some very weird situations. Basically, in the guest vm, I 
 allocate a chunk
 of memory (with size of a page) in a user level program. Through a 
 guest kernel
 level module and my self defined hypercall, I pass the gva of this 
 memory to
 kvm. Then I try different methods in the hypercall handler to write 
 protect this
 page of memory. You can see that I want to write protect it through 
 ETP instead
 of write protected in the guest page tables.

 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based 
 on the
 function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the 
 codes to
 read sptep (the pointer to spte) instead of spte, so I can modify the 
 spte
 corresponding to this gpa. What I observe is that if I modify spte[0] 
 (I think
 this is the lowest level page table entry corresponding to EPT table; 
 I can
 successfully modify it as the changes are reflected in the result of 
 calling
 kvm_mmu_get_spte_hierarchy again), but my user level program in vm can 
 still
 write to this page.

 In your this blog post, you mentioned (the shadow pages in the highest 
 level
 (level = 4 on EPT)), I don't understand this part. Does this mean I 
 have to
 modify spte[3] instead of spte[0]? I just try modify spte[1] and 
 spte[3], both
 can cause vmexit. So I am totally confused about the meaning of level 
 used in
 shadow page table and its relations to shadow page table. Can you help 
 me to
 understand this?

 2. As suggested by this post, I also use rmap_write_protect() to write 
 protect
 this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I 
 still can see
 that spte[0] gives me xx005 such result, this means that the 
 function is
 called successfully. But still I can write to this page.

 I even try the function kvm_age_hva() to remove this spte, this gives 
 me 0 of
 spte[0], but I still can write to this page. So I am further confused 
 about the
 level used in the shadow page?


 kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold 
 spin_lock(vcpu-kvm-mmu_lock)
 and use for_each_shadow_entry instead. And, after change, did you flush 
 all tlbs?

 I do apply the lock in my codes and I do flush tlb.


 If it can not work, please post your code.


 Here is my codes. The modifications are made in x86/x86.c in

 KVM_HC_HL_EPTPER is my hypercall number.

 Method 1:

 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){


 case KVM_HC_HL_EPTPER :
  This method is not working

 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, localEx);
 if(localGpa == UNMAPPED_GVA){
 printk(read is not correct\n);
 return -KVM_ENOSYS;
 }

 hl_kvm_mmu_update_spte(vcpu, localGpa, 5);
 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
 hl_sptes);

 printk(after changes return result is %d , gpa: %llx
 sptes: %llx , %llx , %llx , %llx \n, hl_result, 

Re: [PATCHv2] virtio-spec: virtio network device multiqueue support

2012-09-04 Thread Jason Wang

On 09/03/2012 07:55 PM, Michael S. Tsirkin wrote:

At Jason's request, I am trying to help finalize the spec for
the new multiqueue feature.

Changes from Jason's rfc:
- reserved vq 3: this makes all rx vqs even and tx vqs odd, which
   looks nicer to me.
- documented packet steering, added a generalized steering programming
   command. Current modes are single queue and host driven multiqueue,
   but I envision support for guest driven multiqueue in the future.


For host driven, more thought in the long term. Maybe we could add more 
policy to choose the rxq such as hashing, round-robin and cpuid.

- make default vqs unused when in mq mode - this wastes some memory
   but makes it more efficient to switch between modes as
   we can avoid this causing packet reordering.


Not sure whether or not this can really helps. Depending on the host 
scheduler, we may always see a disorder when we do the switching.

Rusty, could you please take a look and comment?
If this looks OK to everyone, we can proceed with finalizing the
implementation.  This patch is against
eb9fc84d0d3c46438aaab190e2401a9e5409a052 in virtio-spec git tree.

--

virtio-spec: virtio network device multiqueue support

Add multiqueue support to virtio network device.
Add a new feature flag VIRTIO_NET_F_MULTIQUEUE for this feature, a new
configuration field max_virtqueue_pairs to detect supported number of
virtqueues as well as a new command VIRTIO_NET_CTRL_STEERING to program
packet steering.

Signed-off-by: Michael S. Tsirkinm...@redhat.com

--

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index 7a073f4..583debc 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -58,6 +58,7 @@
  \html_be_strict false
  \author -608949062 Rusty Russell,,,
  \author 1531152142 Paolo Bonzini,,,
+\author 1986246365 Michael S. Tsirkin
  \end_header

  \begin_body
@@ -3896,6 +3897,37 @@ Only if VIRTIO_NET_F_CTRL_VQ set
  \end_inset


+\change_inserted 1986246365 1346663522
+ 3: reserved
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1986246365 1346663550
+4: receiveq1.
+ 5: transmitq1.
+ 6: receiveq2.
+ 7.
+ transmitq2.
+ ...
+ 2N+2:receivqN, 2N+3:transmitqN
+\begin_inset Foot
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1346663558
+Only if VIRTIO_NET_F_CTRL_VQ set.
+ N is indicated by max_virtqueue_pairs field.
+\change_unchanged
+
+\end_layout
+
+\end_inset
+
+
+\change_unchanged
+
  \end_layout

  \begin_layout Description
@@ -4056,6 +4088,17 @@ VIRTIO_NET_F_CTRL_VLAN

  \begin_layout Description
  VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
+\change_inserted 1986246365 1346617842
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1986246365 1346618103
+VIRTIO_NET_F_MULTIQUEUE(22) Device has multiple receive and transmission
+ queues.
+\change_unchanged
+
  \end_layout

  \end_deeper
@@ -4068,11 +4111,45 @@ configuration
  \begin_inset space ~
  \end_inset

-layout Two configuration fields are currently defined.
+layout
+\change_deleted 1986246365 1346671560
+Two
+\change_inserted 1986246365 1346671647
+Six
+\change_unchanged
+ configuration fields are currently defined.
   The mac address field always exists (though is only valid if VIRTIO_NET_F_MAC
   is set), and the status field only exists if VIRTIO_NET_F_STATUS is set.
   Two read-only bits are currently defined for the status field: 
VIRTIO_NET_S_LIN
  K_UP and VIRTIO_NET_S_ANNOUNCE.
+
+\change_inserted 1986246365 1346672138
+ The following four read-only fields only exists if VIRTIO_NET_F_MULTIQUEUE
+ is set.
+ The max_virtqueue_pairs field specifies the maximum number of each of transmit
+ and receive virtqueues that can used for multiqueue operation.


s/can/can be/

+ The following read-only fields:
+\emph on
+current_steering_rule
+\emph default
+,
+\emph on
+reserved
+\emph default
+ and
+\emph on
+current_steering_param
+\emph default
+ store the last successful VIRTIO_NET_CTRL_STEERING
+\begin_inset CommandInset ref
+LatexCommand ref
+reference sub:Transmit-Packet-Steering
+
+\end_inset
+
+ command executed by driver, for debugging.
+
+\change_unchanged

  \begin_inset listings
  inline false
@@ -4105,6 +4182,40 @@ struct virtio_net_config {
  \begin_layout Plain Layout

  u16 status;
+\change_inserted 1986246365 1346671221
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1346671532
+
+u16 max_virtqueue_pairs;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1346671531
+
+u8 current_steering_rule;
+\change_unchanged
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1346671499
+
+u8 reserved;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1346671530
+
+u16 current_steering_param;
+\change_unchanged
+
  \end_layout

  \begin_layout Plain Layout
@@ -4151,6 +4262,18 @@ physical
  \begin_layout Enumerate
  If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify the control
   virtqueue.

Re: [PATCH v2] add acpi pmtimer support

2012-09-04 Thread Gerd Hoffmann
On 09/02/12 22:42, Kevin O'Connor wrote:
 On Tue, Aug 14, 2012 at 07:29:19AM +0200, Gerd Hoffmann wrote:
 This patch makes seabios use the acpi pmtimer instead of tsc for
 timekeeping.  The pmtimer has a fixed frequency and doesn't need
 calibration, thus it doesn't suffer from calibration errors due to a
 loaded host machine.
 
 The patch looks okay to me, but is it still needed?  (I recall seeing
 something on the kvm list about a bug fix to the main timer.)

It is still a good idea to make timing in a virtual machine more robust.

 +u32 pmtimer = inl(ioport);

 +return (u64)wraps  24 | pmtimer;
 
 BTW, why is this  24, and if it should be that way, shouldn't the
 pmtimer be inl(ioport)  0xff ?

The pmtimer is defined to be 24 bits wide, so the shift is correct.
But, yes, the ioport read should better be masked to be on the safe
side.  v3 will go out in a minute.

cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] add acpi pmtimer support

2012-09-04 Thread Gerd Hoffmann
This patch makes seabios use the acpi pmtimer instead of tsc for
timekeeping.  The pmtimer has a fixed frequency and doesn't need
calibration, thus it doesn't suffer from calibration errors due to a
loaded host machine.

[ v3: mask port ioport read ]
[ v2: add CONFIG_PMTIMER ]

Signed-off-by: Gerd Hoffmann kra...@redhat.com
---
 src/Kconfig   |6 ++
 src/clock.c   |   31 +++
 src/pciinit.c |5 +
 src/util.h|1 +
 4 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/src/Kconfig b/src/Kconfig
index 6de3e71..b5dd63b 100644
--- a/src/Kconfig
+++ b/src/Kconfig
@@ -222,6 +222,12 @@ menu Hardware support
 default y
 help
 Initialize the Memory Type Range Registers (on emulators).
+config PMTIMER
+depends on !COREBOOT
+bool Use ACPI timer
+default y
+help
+Use the ACPI timer instead of the TSC for timekeeping (on qemu).
 endmenu
 
 menu BIOS interfaces
diff --git a/src/clock.c b/src/clock.c
index 69e9f17..b4abf37 100644
--- a/src/clock.c
+++ b/src/clock.c
@@ -129,11 +129,42 @@ emulate_tsc(void)
 return ret;
 }
 
+u16 pmtimer_ioport VAR16VISIBLE;
+u32 pmtimer_wraps VARLOW;
+u32 pmtimer_last VARLOW;
+
+void pmtimer_init(u16 ioport, u32 khz)
+{
+if (!CONFIG_PMTIMER)
+return;
+dprintf(1, Using pmtimer, ioport 0x%x, freq %d kHz\n, ioport, khz);
+SET_GLOBAL(pmtimer_ioport, ioport);
+SET_GLOBAL(cpu_khz, khz);
+}
+
+static u64 pmtimer_get(void)
+{
+u16 ioport = GET_GLOBAL(pmtimer_ioport);
+u32 wraps = GET_LOW(pmtimer_wraps);
+u32 pmtimer = inl(ioport);
+
+if (pmtimer  GET_LOW(pmtimer_last)) {
+wraps++;
+SET_LOW(pmtimer_wraps, wraps);
+}
+SET_LOW(pmtimer_last, pmtimer);
+
+dprintf(9, pmtimer: %u:%u\n, wraps, pmtimer);
+return (u64)wraps  24 | pmtimer;
+}
+
 static u64
 get_tsc(void)
 {
 if (unlikely(GET_GLOBAL(no_tsc)))
 return emulate_tsc();
+if (CONFIG_PMTIMER  GET_GLOBAL(pmtimer_ioport))
+return pmtimer_get();
 return rdtscll();
 }
 
diff --git a/src/pciinit.c b/src/pciinit.c
index 68f302a..31115ee 100644
--- a/src/pciinit.c
+++ b/src/pciinit.c
@@ -180,6 +180,9 @@ static const struct pci_device_id pci_class_tbl[] = {
 PCI_DEVICE_END,
 };
 
+/* PM Timer ticks per second (HZ) */
+#define PM_TIMER_FREQUENCY  3579545
+
 /* PIIX4 Power Management device (for ACPI) */
 static void piix4_pm_init(struct pci_device *pci, void *arg)
 {
@@ -191,6 +194,8 @@ static void piix4_pm_init(struct pci_device *pci, void *arg)
 pci_config_writeb(bdf, 0x80, 0x01); /* enable PM io space */
 pci_config_writel(bdf, 0x90, PORT_SMB_BASE | 1);
 pci_config_writeb(bdf, 0xd2, 0x09); /* enable SMBus io space */
+
+pmtimer_init(PORT_ACPI_PM_BASE + 0x08, PM_TIMER_FREQUENCY / 1000);
 }
 
 static const struct pci_device_id pci_device_tbl[] = {
diff --git a/src/util.h b/src/util.h
index 062eea3..7723bb1 100644
--- a/src/util.h
+++ b/src/util.h
@@ -282,6 +282,7 @@ void lpt_setup(void);
 // clock.c
 #define PIT_TICK_RATE 1193180   // Underlying HZ of PIT
 #define PIT_TICK_INTERVAL 65536 // Default interval for 18.2Hz timer
+void pmtimer_init(u16 ioport, u32 khz);
 int check_tsc(u64 end);
 void timer_setup(void);
 void ndelay(u32 count);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] virtio-spec: virtio network device multiqueue support

2012-09-04 Thread Michael S. Tsirkin
On Wed, Sep 05, 2012 at 11:34:15AM +0800, Jason Wang wrote:
 On 09/03/2012 07:55 PM, Michael S. Tsirkin wrote:
 At Jason's request, I am trying to help finalize the spec for
 the new multiqueue feature.
 
 Changes from Jason's rfc:
 - reserved vq 3: this makes all rx vqs even and tx vqs odd, which
looks nicer to me.
 - documented packet steering, added a generalized steering programming
command. Current modes are single queue and host driven multiqueue,
but I envision support for guest driven multiqueue in the future.
 
 For host driven, more thought in the long term. Maybe we could add
 more policy to choose the rxq such as hashing, round-robin and
 cpuid.

As we discussed off-list, different guests may need wildly
different strategies. For example different queues for
different qos priorities might make a lot of sense.
So for now I'll remove the host-driven option and
add _GUEST (or maybe better name is _RX_FOLLOWS_TX)
rule which records the queue number on packet transmit and
uses that on receive.

 - make default vqs unused when in mq mode - this wastes some memory
but makes it more efficient to switch between modes as
we can avoid this causing packet reordering.
 
 Not sure whether or not this can really helps. Depending on the host
 scheduler, we may always see a disorder when we do the switching.

Since guest handles one queue at a time during switch,
won't this mean host reorders packets even with a single queue?

 Rusty, could you please take a look and comment?
 If this looks OK to everyone, we can proceed with finalizing the
 implementation.  This patch is against
 eb9fc84d0d3c46438aaab190e2401a9e5409a052 in virtio-spec git tree.
 
 --
 
 virtio-spec: virtio network device multiqueue support
 
 Add multiqueue support to virtio network device.
 Add a new feature flag VIRTIO_NET_F_MULTIQUEUE for this feature, a new
 configuration field max_virtqueue_pairs to detect supported number of
 virtqueues as well as a new command VIRTIO_NET_CTRL_STEERING to program
 packet steering.
 
 Signed-off-by: Michael S. Tsirkinm...@redhat.com
 
 --
 
 diff --git a/virtio-spec.lyx b/virtio-spec.lyx
 index 7a073f4..583debc 100644
 --- a/virtio-spec.lyx
 +++ b/virtio-spec.lyx
 @@ -58,6 +58,7 @@
   \html_be_strict false
   \author -608949062 Rusty Russell,,,
   \author 1531152142 Paolo Bonzini,,,
 +\author 1986246365 Michael S. Tsirkin
   \end_header
 
   \begin_body
 @@ -3896,6 +3897,37 @@ Only if VIRTIO_NET_F_CTRL_VQ set
   \end_inset
 
 
 +\change_inserted 1986246365 1346663522
 + 3: reserved
 +\end_layout
 +
 +\begin_layout Description
 +
 +\change_inserted 1986246365 1346663550
 +4: receiveq1.
 + 5: transmitq1.
 + 6: receiveq2.
 + 7.
 + transmitq2.
 + ...
 + 2N+2:receivqN, 2N+3:transmitqN
 +\begin_inset Foot
 +status open
 +
 +\begin_layout Plain Layout
 +
 +\change_inserted 1986246365 1346663558
 +Only if VIRTIO_NET_F_CTRL_VQ set.
 + N is indicated by max_virtqueue_pairs field.
 +\change_unchanged
 +
 +\end_layout
 +
 +\end_inset
 +
 +
 +\change_unchanged
 +
   \end_layout
 
   \begin_layout Description
 @@ -4056,6 +4088,17 @@ VIRTIO_NET_F_CTRL_VLAN
 
   \begin_layout Description
   VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
 +\change_inserted 1986246365 1346617842
 +
 +\end_layout
 +
 +\begin_layout Description
 +
 +\change_inserted 1986246365 1346618103
 +VIRTIO_NET_F_MULTIQUEUE(22) Device has multiple receive and transmission
 + queues.
 +\change_unchanged
 +
   \end_layout
 
   \end_deeper
 @@ -4068,11 +4111,45 @@ configuration
   \begin_inset space ~
   \end_inset
 
 -layout Two configuration fields are currently defined.
 +layout
 +\change_deleted 1986246365 1346671560
 +Two
 +\change_inserted 1986246365 1346671647
 +Six
 +\change_unchanged
 + configuration fields are currently defined.
The mac address field always exists (though is only valid if 
  VIRTIO_NET_F_MAC
is set), and the status field only exists if VIRTIO_NET_F_STATUS is set.
Two read-only bits are currently defined for the status field: 
  VIRTIO_NET_S_LIN
   K_UP and VIRTIO_NET_S_ANNOUNCE.
 +
 +\change_inserted 1986246365 1346672138
 + The following four read-only fields only exists if VIRTIO_NET_F_MULTIQUEUE
 + is set.
 + The max_virtqueue_pairs field specifies the maximum number of each of 
 transmit
 + and receive virtqueues that can used for multiqueue operation.
 
 s/can/can be/
 + The following read-only fields:
 +\emph on
 +current_steering_rule
 +\emph default
 +,
 +\emph on
 +reserved
 +\emph default
 + and
 +\emph on
 +current_steering_param
 +\emph default
 + store the last successful VIRTIO_NET_CTRL_STEERING
 +\begin_inset CommandInset ref
 +LatexCommand ref
 +reference sub:Transmit-Packet-Steering
 +
 +\end_inset
 +
 + command executed by driver, for debugging.
 +
 +\change_unchanged
 
   \begin_inset listings
   inline false
 @@ -4105,6 +4182,40 @@ struct virtio_net_config {
   \begin_layout Plain Layout
 
   u16 status;
 +\change_inserted 1986246365 1346671221
 +
 

Re: [PATCH 6/6] powerpc/booke64: restore VDSO information on critical exception

2012-09-04 Thread Benjamin Herrenschmidt
On Mon, 2012-08-06 at 16:27 +0300, Mihai Caraman wrote:
 Critical exception handler on 64-bit booke uses user-visible SPRG3 as scratch.
 Restore VDSO information in SPRG3 on exception prolog.

Breaks the build on !BOOKE because of :

 diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
 index b67db22..a0b0d08 100644
 --- a/arch/powerpc/kernel/vdso.c
 +++ b/arch/powerpc/kernel/vdso.c
 @@ -725,6 +725,8 @@ int __cpuinit vdso_getcpu_init(void)
   mtspr(SPRN_SPRG3, val);
  #ifdef CONFIG_KVM_BOOK3S_HANDLER
   get_paca()-kvm_hstate.sprg3 = val;
 +#elif CONFIG_PPC_BOOK3E


You can't #elif a CONFIG option.

 + get_paca()-sprg3 = val;
  #endif
  
   put_cpu();

Now, my suggestion is to actually move the bloody thing out of
kvm_hstate on server as well, just make it a common sprg3 field
accross the board.

I'm dropping this one patch (the other ones seem fine so far and will
land in next soon unless I find another problem).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5]KVM:Enable APIC-Register Virtualization and Virtual-interrupt delivery

2012-09-04 Thread Li, Jiongxi
The VMCS includes controls that enable the virtualization of interrupts and the 
Advanced Programmable Interrupt Controller (APIC).
When these controls are used, the processor will emulate many accesses to the 
APIC, track the state of the virtual APIC, and deliver virtual interrupts - all 
in VMX non-root operation without a VM exit.
You can refer to Chapter 29 of the latest SDM.

APICv support in KVM is split into 5 patches:
  0001-x86-apicv-add-APICv-register-virtualization-support.patch - enable APICv 
register virtualization
  0002-x86-apicv-adjust-for-virtual-interrupt-delivery.patch - add basic KVM 
frameowrk for virtual interrupt delivery
  0003-x86-apicv-enable-virtual-interrupt-delivery-for-VMX.patch - enable APICv 
virtual interrupt delivery
  0004-x86-apicv-add-interface-for-poking-EOI-exit-bitmap.patch - EOI exit 
bitmap handling
  0005-x86-apicv-add-virtual-x2apic-support.patch - handle MSR style in virtual 
x2apic

Apply them in above order
APICv is disabled by default, and use below command to enable it:
modprobe enable_apicv_reg=1 enable_apicv_vid=1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5]KVM: x86, apicv: add APICv register virtualization support

2012-09-04 Thread Li, Jiongxi
- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Jiongxi Li jiongxi...@intel.com
---
 arch/x86/include/asm/vmx.h |2 ++
 arch/x86/kvm/lapic.c   |   16 
 arch/x86/kvm/lapic.h   |2 ++
 arch/x86/kvm/vmx.c |   30 ++
 4 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 74fcb96..4a8193e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -59,6 +59,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID  0x0020
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
+#define SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
@@ -282,6 +283,7 @@ enum vmcs_field {
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD 54
 #define EXIT_REASON_XSETBV 55
+#define EXIT_REASON_APIC_WRITE 56
 #define EXIT_REASON_INVPCID58
 
 /*
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ce87878..4a6d3a4 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1037,6 +1037,22 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+/* emulate APIC access in a trap manner */
+int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
+{
+   u32 val;
+
+   /* hw has done the conditional check and inst decode */
+   offset = 0xff0;
+   if ((offset != APIC_EOI) 
+apic_reg_read(vcpu-arch.apic, offset, 4, val))
+   return 1;
+
+   /* TODO: optimize to just emulate side effect w/o one more write */
+   return apic_reg_write(vcpu-arch.apic, offset, val);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
+
 void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu-arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4af5405..cd4875e 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -55,6 +55,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c00f03d..3d92277 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -83,6 +83,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_reg = 0;
+module_param(enable_apicv_reg, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -760,6 +763,12 @@ static inline bool 
cpu_has_vmx_virtualize_apic_accesses(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_apic_register_virt(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_APIC_REGISTER_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
return cpu_has_vmx_tpr_shadow() 
@@ -2475,6 +2484,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_UNRESTRICTED_GUEST |
SECONDARY_EXEC_PAUSE_LOOP_EXITING |
SECONDARY_EXEC_RDTSCP |
+   SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_ENABLE_INVPCID;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
@@ -2486,6 +2496,11 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
_cpu_based_exec_control = ~CPU_BASED_TPR_SHADOW;
 #endif
+
+   if (!(_cpu_based_exec_control  CPU_BASED_TPR_SHADOW))
+   _cpu_based_2nd_exec_control = ~(
+   SECONDARY_EXEC_APIC_REGISTER_VIRT);
+
if (_cpu_based_2nd_exec_control  SECONDARY_EXEC_ENABLE_EPT) {
/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
   enabled */
@@ -2683,6 +2698,9 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_ple())
ple_gap = 0;
 
+   if (!cpu_has_vmx_apic_register_virt())
+   enable_apicv_reg = 0;
+
if (nested)
nested_vmx_setup_ctls_msrs();
 
@@ -3812,6 

[PATCH 2/5]KVM:x86, apicv: adjust for virtual interrupt delivery

2012-09-04 Thread Li, Jiongxi
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

  - for pending interrupt, instead of direct injection, we may need
update architecture specific indicators before resuming to guest.

  - A pending interrupt, which is masked by ISR, should be also
considered in above update action, since hardware will decide
when to inject it at right time. Current has_interrupt and
get_interrupt only returns a valid vector from injection p.o.v.

Three new interfaces are introduced accordingly:
kvm_apic_get_highest_irr
kvm_cpu_has_interrupt_apicv_vid
kvm_cpu_get_interrupt_apic_vid

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Jiongxi Li jiongxi...@intel.com
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/irq.c  |   44 +++
 arch/x86/kvm/lapic.c|   13 +++
 arch/x86/kvm/lapic.h|   10 
 arch/x86/kvm/svm.c  |6 +
 arch/x86/kvm/vmx.c  |6 +
 arch/x86/kvm/x86.c  |   22 +-
 7 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..ef74df5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -669,6 +669,8 @@ struct kvm_x86_ops {
void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
void (*enable_irq_window)(struct kvm_vcpu *vcpu);
void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+   int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
+   void (*update_irq)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 7e06ba1..abd3831 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -60,6 +60,29 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
 
 /*
+ * check if there is pending interrupt without
+ * intack. This _apicv version is used when hardware
+ * supports APIC virtualization with virtual interrupt
+ * delivery support. In such case, KVM is not required
+ * to poll pending APIC interrupt, and thus this
+ * interface is used to poll pending interupts from
+ * non-APIC source.
+ */
+int kvm_cpu_has_interrupt_apic_vid(struct kvm_vcpu *v)
+{
+   struct kvm_pic *s;
+
+   if (!irqchip_in_kernel(v-kvm))
+   return v-arch.interrupt.pending;
+
+   if (kvm_apic_accept_pic_intr(v)) {
+   s = pic_irqchip(v-kvm);/* PIC */
+   return s-output;
+   } else
+   return 0;
+}
+
+/*
  * Read pending interrupt vector and intack.
  */
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
@@ -82,6 +105,27 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
+/*
+ * Read pending interrupt vector and intack.
+ * Similar to kvm_cpu_has_interrupt_apicv, to get
+ * interrupts from non-APIC sources.
+ */
+int kvm_cpu_get_interrupt_apic_vid(struct kvm_vcpu *v)
+{
+   struct kvm_pic *s;
+   int vector = -1;
+
+   if (!irqchip_in_kernel(v-kvm))
+   return v-arch.interrupt.nr;
+
+   if (kvm_apic_accept_pic_intr(v)) {
+   s = pic_irqchip(v-kvm);
+   s-output = 0;  /* PIC */
+   vector = kvm_pic_read_irq(v-kvm);
+   }
+   return vector;
+}
+
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
kvm_inject_apic_timer_irqs(vcpu);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 4a6d3a4..c47f3d3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1310,6 +1310,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
kvm_lapic_reset(vcpu);
kvm_iodevice_init(apic-dev, apic_mmio_ops);
 
+   if (kvm_x86_ops-has_virtual_interrupt_delivery(vcpu))
+   apic-vid_enabled = true;
return 0;
 nomem_free_apic:
kfree(apic);
@@ -1333,6 +1335,17 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
return highest_irr;
 }
 
+int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
+{
+struct kvm_lapic *apic = vcpu-arch.apic;
+
+if (!apic || !apic_enabled(apic))
+return -1;
+
+return apic_find_highest_irr(apic);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
+
 int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
 {
u32 lvt0 = apic_get_reg(vcpu-arch.apic, APIC_LVT0);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index cd4875e..4e3b435 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -13,6 +13,7 @@ struct kvm_lapic {
u32 divide_count;

[PATCH 3/5]KVM:x86, apicv: enable virtual interrupt delivery for VMX

2012-09-04 Thread Li, Jiongxi
- before returning to guest, RVI should be updated if any pending IRRs
- EOI exit bitmap controls whether an EOI write should cause VM-Exit.
  if set, a trap-like induced EOI VM-Exit is triggered. Keep all the
  bitmaps cleared for now, which should be enough to allow a MSI based
  device passthrough

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Jiongxi Li jiongxi...@intel.com
---
 arch/x86/include/asm/vmx.h |   11 
 arch/x86/kvm/lapic.c   |   22 +++-
 arch/x86/kvm/lapic.h   |1 +
 arch/x86/kvm/vmx.c |   62 ++-
 4 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4a8193e..b1eca96 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -60,6 +60,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
 #define SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100
+#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY0x0200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
@@ -97,6 +98,7 @@ enum vmcs_field {
GUEST_GS_SELECTOR   = 0x080a,
GUEST_LDTR_SELECTOR = 0x080c,
GUEST_TR_SELECTOR   = 0x080e,
+   GUEST_INTR_STATUS   = 0x0810,
HOST_ES_SELECTOR= 0x0c00,
HOST_CS_SELECTOR= 0x0c02,
HOST_SS_SELECTOR= 0x0c04,
@@ -124,6 +126,14 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
+   EOI_EXIT_BITMAP0= 0x201c,
+   EOI_EXIT_BITMAP0_HIGH   = 0x201d,
+   EOI_EXIT_BITMAP1= 0x201e,
+   EOI_EXIT_BITMAP1_HIGH   = 0x201f,
+   EOI_EXIT_BITMAP2= 0x2020,
+   EOI_EXIT_BITMAP2_HIGH   = 0x2021,
+   EOI_EXIT_BITMAP3= 0x2022,
+   EOI_EXIT_BITMAP3_HIGH   = 0x2023,
GUEST_PHYSICAL_ADDRESS  = 0x2400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x2401,
VMCS_LINK_POINTER   = 0x2800,
@@ -279,6 +289,7 @@ enum vmcs_field {
 #define EXIT_REASON_MCE_DURING_VMENTRY  41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS 44
+#define EXIT_REASON_EOI_INDUCED 45
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD 54
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c47f3d3..d203501 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -604,7 +604,27 @@ static int apic_set_eoi(struct kvm_lapic *apic)
return vector;
 }
 
-static void apic_send_ipi(struct kvm_lapic *apic)
+/*
+ * this interface assumes a trap-like exit, which has already finished
+ * desired side effect including vISR and vPPR update.
+ */
+void kvm_apic_set_eoi(struct kvm_vcpu *vcpu, int vector)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+   int trigger_mode;
+
+   if (apic_test_and_clear_vector(vector, apic-regs + APIC_TMR))
+   trigger_mode = IOAPIC_LEVEL_TRIG;
+   else
+   trigger_mode = IOAPIC_EDGE_TRIG;
+
+   if (!(apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI))
+   kvm_ioapic_update_eoi(apic-vcpu-kvm, vector, trigger_mode);
+   kvm_make_request(KVM_REQ_EVENT, apic-vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_set_eoi);
+
+ static void apic_send_ipi(struct kvm_lapic *apic)
 {
u32 icr_low = apic_get_reg(apic, APIC_ICR);
u32 icr_high = apic_get_reg(apic, APIC_ICR2);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4e3b435..585337f 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -60,6 +60,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
 int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+void kvm_apic_set_eoi(struct kvm_vcpu *vcpu, int vector);
 
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4a26d04..424a09d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -86,6 +86,9 @@ module_param(fasteoi, bool, S_IRUGO);
 static bool __read_mostly enable_apicv_reg = 0;
 module_param(enable_apicv_reg, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_vid = 0;
+module_param(enable_apicv_vid, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ 

[PATCH 4/5]KVM:x86, apicv: add interface for poking EOI exit bitmap

2012-09-04 Thread Li, Jiongxi
With APICv virtual interrupt delivery feature, EOI write from non
root mode doesn't cause VM-Exit unless set in EOI exit bitmap VMCS
field. Basically there're two methods to manipulate EOI exit bitmap:

[Option 1]
Ideally only level triggered irq requires a hook in vLAPIC EOI write,
so that vIOAPIC EOI is triggered and emulated. So the simplest
approach is to manipulate EOI exit bitmap when vLAPIC acks a new
interrupt, based on value of TMR. There're several corner cases
worthy of note though:

  - KVM has specific notifier hooks on vIOAPIC EOI path. So far two
sources use it: INT-based device passthrough and PIT pending
timers. For the former, it's virtually wired to vIOAPIC and
thus TMR already covers it. PIT is special here, which is an
edge triggered source. But since other timer sources like
vLAPIC timer don't require this notifier hook, possibly PIT
can be relaxed in the future too.

  - posted interrupt will update TMR directly, w/o chance for KVM
to update EOI exit bitmap accordingly. This becomes a gap

[Option 2]
Indicate EOI exit bitmap requirement ('need_eoi') directly from
every interrupt source device, and then check this requirement
when vLAPIC acks a new pending interrupt. This requires more
intrusive changes to current vLAPIC/vIOAPIC logic, so that the
irq_source_id indicating source of interrupt is passed through
from origination point to vLAPIC ack point. For natual requirement
like vIOAPIC level triggered entries, it can be implicitly deduced.
On the other hand for non-natural requirements like aformentioned
PIT or posted interrupt, this approach can handle it efficiently.

For simplicity reason, now option 1 is used which should be
enough to test MSI-based device passthrough.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Jiongxi Li jiongxi...@intel.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/lapic.c|7 ++-
 arch/x86/kvm/vmx.c  |   37 +
 3 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ef74df5..4e06a82 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -671,6 +671,7 @@ struct kvm_x86_ops {
void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
void (*update_irq)(struct kvm_vcpu *vcpu);
+   void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, int 
need_eoi);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d203501..4058384 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -499,8 +499,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
if (trig_mode) {
apic_debug(level trig mode for vector %d, vector);
apic_set_vector(vector, apic-regs + APIC_TMR);
-   } else
+   if (kvm_apic_vid_enabled(vcpu))
+   kvm_x86_ops-set_eoi_exitmap(vcpu, vector, 1);
+   } else {
apic_clear_vector(vector, apic-regs + APIC_TMR);
+   if (kvm_apic_vid_enabled(vcpu))
+   kvm_x86_ops-set_eoi_exitmap(vcpu, vector, 0);
+   }
 
result = !apic_test_and_set_irr(vector, apic);
trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 424a09d..73ff537 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -433,6 +433,7 @@ struct vcpu_vmx {
 
bool rdtscp_enabled;
 
+   u32 eoi_exitmap_changed;
u64 eoi_exit_bitmap[4];
 
/* Support for a guest hypervisor (nested VMX) */
@@ -6128,6 +6129,7 @@ static void vmx_update_irq(struct kvm_vcpu *vcpu)
u16 status;
u8 old;
int vector;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
 
vector = kvm_apic_get_highest_irr(vcpu);
if (vector == -1)
@@ -6140,6 +6142,40 @@ static void vmx_update_irq(struct kvm_vcpu *vcpu)
status |= (u8)vector;
vmcs_write16(GUEST_INTR_STATUS, status);
}
+
+   if (vmx-eoi_exitmap_changed) {
+#define UPDATE_EOI_EXITMAP(v, e) { \
+   if (test_and_clear_bit(e, (void *)(v)-eoi_exitmap_changed))   \
+   vmcs_write64(EOI_EXIT_BITMAP##e, (v)-eoi_exit_bitmap[e]);}
+
+   UPDATE_EOI_EXITMAP(vmx, 0);
+   UPDATE_EOI_EXITMAP(vmx, 1);
+   UPDATE_EOI_EXITMAP(vmx, 2);
+   UPDATE_EOI_EXITMAP(vmx, 3);
+   }
+}
+
+static void vmx_set_eoi_exitmap(struct kvm_vcpu *vcpu,
+  

[PATCH 5/5]KVM:x86, apicv: add virtual x2apic support

2012-09-04 Thread Li, Jiongxi
basically to benefit from apicv, we need clear MSR bitmap for
corresponding x2apic MSRs:
0x800 - 0x8ff: no read intercept for apicv register virtualization
TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Jiongxi Li jiongxi...@intel.com
---
 arch/x86/kvm/vmx.c |   46 +++---
 1 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 73ff537..2db1ddc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3724,7 +3724,9 @@ static void free_vpid(struct vcpu_vmx *vmx)
spin_unlock(vmx_vpid_lock);
 }
 
-static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
+#define MSR_TYPE_R 1
+#define MSR_TYPE_W 2
+static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 
msr, int type)
 {
int f = sizeof(unsigned long);
 
@@ -3737,20 +3739,38 @@ static void __vmx_disable_intercept_for_msr(unsigned 
long *msr_bitmap, u32 msr)
 * We can control MSRs 0x-0x1fff and 0xc000-0xc0001fff.
 */
if (msr = 0x1fff) {
-   __clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
-   __clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
+   if (type  MSR_TYPE_R)
+   __clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
+   if (type  MSR_TYPE_W)
+   __clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low 
*/
} else if ((msr = 0xc000)  (msr = 0xc0001fff)) {
msr = 0x1fff;
-   __clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
-   __clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
+   if (type  MSR_TYPE_R)
+   __clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high 
*/
+   if (type  MSR_TYPE_W)
+   __clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high 
*/
}
 }
 
 static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
 {
if (!longmode_only)
-   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
-   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
+   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr, 
MSR_TYPE_R | MSR_TYPE_W);
+   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr, 
MSR_TYPE_R | MSR_TYPE_W);
+}
+
+static void vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only)
+{
+   if (!longmode_only)
+   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr, 
MSR_TYPE_R);
+   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr, 
MSR_TYPE_R);
+}
+
+static void vmx_disable_intercept_for_msr_write(u32 msr, bool longmode_only)
+{
+   if (!longmode_only)
+   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr, 
MSR_TYPE_W);
+   __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr, 
MSR_TYPE_W);
 }
 
 /*
@@ -7524,6 +7544,18 @@ static int __init vmx_init(void)
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
 
+   if (enable_apicv_reg) {
+   int msr;
+   for (msr = 0x800; msr = 0x8ff; msr++)
+   vmx_disable_intercept_for_msr_read(msr, false);
+   }
+
+   if (enable_apicv_vid) {
+   vmx_disable_intercept_for_msr_write(0x808, false); // TPR
+   vmx_disable_intercept_for_msr_write(0x80b, false); // EOI
+   vmx_disable_intercept_for_msr_write(0x83f, false); // SELF-IPI
+   }
+
if (enable_ept) {
kvm_mmu_set_mask_ptes(0ull,
(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] powerpc-kvm: fixing page alignment for TCE

2012-09-04 Thread Alexey Kardashevskiy
From: Paul Mackerras pau...@samba.org

TODO: ask Paul to make a proper message.

This is the fix for a host kernel compiled with a page size
other than 4K (TCE page size). In the case of a 64K page size,
the host used to lose address bits in hpte_rpn().
The patch fixes it.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 80a5775..a41f11b 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -503,7 +503,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
struct kvm *kvm = vcpu-kvm;
unsigned long *hptep, hpte[3], r;
unsigned long mmu_seq, psize, pte_size;
-   unsigned long gfn, hva, pfn;
+   unsigned long gpa, gfn, hva, pfn;
struct kvm_memory_slot *memslot;
unsigned long *rmap;
struct revmap_entry *rev;
@@ -541,15 +541,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
/* Translate the logical address and get the page */
psize = hpte_page_size(hpte[0], r);
-   gfn = hpte_rpn(r, psize);
+   gpa = (r  HPTE_R_RPN  ~(psize - 1)) | (ea  (psize - 1));
+   gfn = gpa  PAGE_SHIFT;
memslot = gfn_to_memslot(kvm, gfn);
 
/* No memslot means it's an emulated MMIO region */
-   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID)) {
-   unsigned long gpa = (gfn  PAGE_SHIFT) | (ea  (psize - 1));
+   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID))
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea,
  dsisr  DSISR_ISSTORE);
-   }
 
if (!kvm-arch.using_mmu_notifiers)
return -EFAULT; /* should never get here */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] powerpc/booke64: restore VDSO information on critical exception

2012-09-04 Thread Benjamin Herrenschmidt
On Mon, 2012-08-06 at 16:27 +0300, Mihai Caraman wrote:
 Critical exception handler on 64-bit booke uses user-visible SPRG3 as scratch.
 Restore VDSO information in SPRG3 on exception prolog.

Breaks the build on !BOOKE because of :

 diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
 index b67db22..a0b0d08 100644
 --- a/arch/powerpc/kernel/vdso.c
 +++ b/arch/powerpc/kernel/vdso.c
 @@ -725,6 +725,8 @@ int __cpuinit vdso_getcpu_init(void)
   mtspr(SPRN_SPRG3, val);
  #ifdef CONFIG_KVM_BOOK3S_HANDLER
   get_paca()-kvm_hstate.sprg3 = val;
 +#elif CONFIG_PPC_BOOK3E


You can't #elif a CONFIG option.

 + get_paca()-sprg3 = val;
  #endif
  
   put_cpu();

Now, my suggestion is to actually move the bloody thing out of
kvm_hstate on server as well, just make it a common sprg3 field
accross the board.

I'm dropping this one patch (the other ones seem fine so far and will
land in next soon unless I find another problem).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html