[dpdk-dev] Vhost user: Increase robustness by kicking guest at ring full

2016-08-25 Thread Patrik Andersson R
Hi,

during trouble shooting sessions (OVS 2.4.1, DPDK 2.2.0) it was noticed
that some guests trigger the SET_VRING_CALL message rather frequently. This
can be all from a few times per minute up to 10 times per second.

 From DPDK log:
...
2016-08-01T19:58:39.829222+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL, 1
2016-08-01T19:58:39.829232+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: vring call idx:0 file:251
2016-08-01T19:58:39.829246+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL, 1
2016-08-01T19:58:39.829250+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: vring call idx:0 file:215
2016-08-01T19:58:40.778491+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL, 1
2016-08-01T19:58:40.778501+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: vring call idx:0 file:251
2016-08-01T19:58:40.778517+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL, 1
2016-08-01T19:58:40.778521+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: vring call idx:0 file:215
2016-08-01T19:58:41.813467+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL, 1
2016-08-01T19:58:41.813479+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: vring call idx:0 file:251
2016-08-01T19:58:41.813499+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL, 1
2016-08-01T19:58:41.813505+09:00 compute-0-6 ovs-vswitchd[140481]: 
VHOST_CONFIG: vring call idx:0 file:215
...

Note that the ", 1" at the end of the log entries is the file handle index
added in a debug build of DPDK, not part of vanilla DPDK.


At high packet rate this might induce the kicking of the guest to fail
repeatedly while enqueueing packets, due to the vq->callfd not being valid
during the time its being reconfigured.

Sporadically this leads to the virtio ring becoming full. Once full the
enqueue functionality in DPDK stops kicking the guest. As the guest is
interrupt driven and has not received all kicks it will not empty the
virtio ring. Possibly there is some flaw also in the guest virtio driver
to make this happen.

To "solve" this problem, the kick operation in virtio_dev_merge_rx() was
excluded from the pkt_idx > 0 condition. A similar change was done in
virtio_dev_rx().


Original vhost_rxtx.c, virtio_dev_merge_rx():
...
merge_rx_exit:
 if (likely(pkt_idx)) {
 /* flush used->idx update before we read avail->flags. */
 rte_mb();

 /* Kick the guest if necessary. */
 if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 eventfd_write(vq->callfd, (eventfd_t)1);
 }

 return pkt_idx;
}
...


Questions

   - Is it a valid operation to change the call/kick file descriptors
 (frequently) during device operation?

   - For stability reasons it seems to me that performing a kick even 
when the
 virtio ring is full is prudent. Since the check for packets put on the
 ring is there at all in the code, could it be that there is a penalty
 of kicking at ring full?

   - Would there be other ways to protect against the call file descriptor
 changing frequently? Assuming that virtio device events in the 
guest will
 cause the occasional SET_VRING_CALL message as part of normal 
operation.


Any discussion on this topic will be appreciated.


Regards,

Patrik



[dpdk-dev] [RFC] vhost user: add error handling for fd > 1023

2016-04-11 Thread Patrik Andersson R
Yes, that is correct. Closing the socket on failure needs to be added.

Regards,

Patrik



On 04/11/2016 11:34 AM, Christian Ehrhardt wrote:
> I like the approach as well to go for the fix for robustness first.
>
> I was accidentally able to find another testcase to hit the same root 
> cause.
> Adding guests with 15 vhost_user based NICs each while having rxq for 
> openvswitch-dpdk set to 4 and multiqueue for the guest devices at 4 
> already breaks when adding the thirds such guests.
> That is way earlier than I would have expected the fd's to be 
> exhausted but still the same root cause, so just another test for the 
> same.
>
> In prep to the wider check to the patch a minor review question from me:
> On the section of rte_vhost_driver_register that now detects if there 
> were issues we might want to close the socketfd as well when bailing out.
> Otherwise we would just have another source of fd leaks or would that 
> be reused later on even now that we have freed vserver-path and 
> vserver itself?
>
>ret = fdset_add(_vhost_server.fdset, vserver->listenfd,
>  vserver_new_vq_conn, NULL, vserver);
>if (ret < 0) {
>  pthread_mutex_unlock(_vhost_server.server_mutex);
>  RTE_LOG(ERR, VHOST_CONFIG,
>"failed to add listen fd %d to vhost server fdset\n",
>vserver->listenfd);
>  free(vserver->path);
> +  close(vserver->listenfd);
>  free(vserver);
>return -1;
>}
>
>
> Christian Ehrhardt
> Software Engineer, Ubuntu Server
> Canonical Ltd
>
> On Mon, Apr 11, 2016 at 8:06 AM, Patrik Andersson R 
>  <mailto:patrik.r.andersson at ericsson.com>> wrote:
>
> I fully agree with this course of action.
>
> Thank you,
>
> Patrik
>
>
>
> On 04/08/2016 08:47 AM, Xie, Huawei wrote:
>
> On 4/7/2016 10:52 PM, Christian Ehrhardt wrote:
>
> I totally agree to that there is no deterministic rule
> what to expect.
> The only rule is that #fd certainly always is >
> #vhost_user devices.
> In various setup variants I've crossed fd 1024 anywhere
> between 475
> and 970 vhost_user ports.
>
> Once the discussion continues and we have an updates
> version of the
> patch with some more agreement I hope I can help to test it.
>
> Thanks. Let us first temporarily fix this problem for
> robustness, then
> we consider whether upgrade to (e)poll.
> Will check the patch in detail later. Basically it should work
> but need
> check whether we need extra fixes elsewhere.
>
>
>



[dpdk-dev] [RFC] vhost user: add error handling for fd > 1023

2016-04-11 Thread Patrik Andersson R
I fully agree with this course of action.

Thank you,

Patrik


On 04/08/2016 08:47 AM, Xie, Huawei wrote:
> On 4/7/2016 10:52 PM, Christian Ehrhardt wrote:
>> I totally agree to that there is no deterministic rule what to expect.
>> The only rule is that #fd certainly always is > #vhost_user devices.
>> In various setup variants I've crossed fd 1024 anywhere between 475
>> and 970 vhost_user ports.
>>
>> Once the discussion continues and we have an updates version of the
>> patch with some more agreement I hope I can help to test it.
> Thanks. Let us first temporarily fix this problem for robustness, then
> we consider whether upgrade to (e)poll.
> Will check the patch in detail later. Basically it should work but need
> check whether we need extra fixes elsewhere.



[dpdk-dev] vhost: no protection against malformed queue descriptors in rte_vhost_dequeue_burst()

2016-03-17 Thread Patrik Andersson R
Hi Huawei,

thank you for the quick response and for the pointer to  the 16.04-rc1
version. Nice!

I think it would be great also to have a sanity check on the gpa_to_vva().
Although nothing recent has hit it we had some problems in that area
in the past.

Regards,

Patrik

On 03/17/2016 02:35 AM, Xie, Huawei wrote:
> On 3/16/2016 8:53 PM, Patrik Andersson R wrote:
>> Hello,
>>
>> When taking a snapshot of a running VM instance, using OpenStack
>> "nova image-create", I noticed that one OVS pmd-thread eventually
>> failed in DPDK rte_vhost_dequeue_burst() with repeating log entries:
>>
>> compute-0-6 ovs-vswitchd[38172]: VHOST_DATA: Failed to allocate
>> memory for mbuf.
>>
>>
>> Debugging (data included further down) this issue lead to the
>> observation that there is no protection against malformed vhost
>> queue descriptors, thus tenant separation might be violated as a
>> single faulty VM might bring down the connectivity of all VMs
>> connected to the same virtual switch.
>>
>> To avoid this, validation would be needed at some points in the
>> rte_vhost_dequeue_burst() code:
>>
>>1) when the queue descriptor is picked up for processing,
>>desc->flags and desc->len might both be 0
>>
>> ...
>> desc = >desc[head[entry_success]];
>> ...
>> /* Discard first buffer as it is the virtio header */
>> if (desc->flags & VRING_DESC_F_NEXT) {
>>  desc = >desc[desc->next];
>>  vb_offset = 0;
>>  vb_avail = desc->len;
>> } else {
>>  vb_offset = vq->vhost_hlen;
>>  vb_avail = desc->len - vb_offset;
>> }
>>  
>>
>>2) at buffer address translation gpa_to_vva(), might fail
>>returning NULL as indication
>>
>> vb_addr = gpa_to_vva(dev, desc->addr);
>> ...
>> while (cpy_len != 0) {
>>  rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
>>  (void *)((uintptr_t)(vb_addr + vb_offset)),
>>  cpy_len);
>> ...
>> }
>> ...
>>
>>
>> Wondering if there are any plans of adding any kind of validation in
>> DPDK, or if it would be useful to suggest specific implementation of
>> such validations in the DPDK code?
>>
>> Or is there some mechanism that gives us the confidence to trust
>> the vhost queue content absolutely?
>>
>>
>>
>> Debugging data:
>>
>> For my scenario the problem occurs in DPDK rte_vhost_dequeue_burst()
>> due to use of a vhost queue descriptor that has all fields 0:
>>
>>(gdb) print *desc
>> {addr = 0, len = 0, flags = 0, next = 0}
>>
>>
>> Subsequent use of desc->len to compute vb_avail = desc->len - vb_offset,
>> leads to the problem observed. What happens is that the packet needs to
>> be segmented -- on my system it fails roughly at segment 122000 when
>> memory available for mbufs run out.
>>
>> The relevant local variables for rte_vhost_dequeue_burst() when breaking
>> on the condition desc->len == 0:
>>
>> vb_avail = 4294967284  (0xfff4)
>> seg_avail = 2608
>> vb_offset = 12
>> cpy_len = 2608
>> seg_num = 1
>> desc = 0x2aadb6e5c000
>> vb_addr = 46928960159744
>> entry_success = 0
>>
>> Note also that there is no crash despite to the desc->addr being zero,
>> it is a valid address in the regions mapped to the device. Although, the
>> 3 regions mapped does not seem to be correct either at this stage.
>>
>>
>> The versions that I'm running are OVS 2.4.0, with corrections from the
>> 2.4 branch, and DPDK 2.1.0. QEMU emulator version 2.2.0 and
>> libvirt version 1.2.12.
>>
>>
>> Regards,
>>
>> Patrik
> Thanks Patrik. You are right. We had planned to enhance the robustness
> of vhost so that neither malicious nor buggy guest virtio driver could
> corrupt vhost. Actually the 16.04 RC1 has fixed some issues (the return
> of gpa_to_vva isn't checked).
>



[dpdk-dev] vhost: no protection against malformed queue descriptors in rte_vhost_dequeue_burst()

2016-03-16 Thread Patrik Andersson R
Hello,

When taking a snapshot of a running VM instance, using OpenStack
"nova image-create", I noticed that one OVS pmd-thread eventually
failed in DPDK rte_vhost_dequeue_burst() with repeating log entries:

compute-0-6 ovs-vswitchd[38172]: VHOST_DATA: Failed to allocate 
memory for mbuf.


Debugging (data included further down) this issue lead to the
observation that there is no protection against malformed vhost
queue descriptors, thus tenant separation might be violated as a
single faulty VM might bring down the connectivity of all VMs
connected to the same virtual switch.

To avoid this, validation would be needed at some points in the
rte_vhost_dequeue_burst() code:

   1) when the queue descriptor is picked up for processing,
   desc->flags and desc->len might both be 0

...
desc = >desc[head[entry_success]];
...
/* Discard first buffer as it is the virtio header */
if (desc->flags & VRING_DESC_F_NEXT) {
 desc = >desc[desc->next];
 vb_offset = 0;
 vb_avail = desc->len;
} else {
 vb_offset = vq->vhost_hlen;
 vb_avail = desc->len - vb_offset;
}
 

   2) at buffer address translation gpa_to_vva(), might fail
   returning NULL as indication

vb_addr = gpa_to_vva(dev, desc->addr);
...
while (cpy_len != 0) {
 rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 (void *)((uintptr_t)(vb_addr + vb_offset)),
 cpy_len);
...
}
...


Wondering if there are any plans of adding any kind of validation in
DPDK, or if it would be useful to suggest specific implementation of
such validations in the DPDK code?

Or is there some mechanism that gives us the confidence to trust
the vhost queue content absolutely?



Debugging data:

For my scenario the problem occurs in DPDK rte_vhost_dequeue_burst()
due to use of a vhost queue descriptor that has all fields 0:

   (gdb) print *desc
{addr = 0, len = 0, flags = 0, next = 0}


Subsequent use of desc->len to compute vb_avail = desc->len - vb_offset,
leads to the problem observed. What happens is that the packet needs to
be segmented -- on my system it fails roughly at segment 122000 when
memory available for mbufs run out.

The relevant local variables for rte_vhost_dequeue_burst() when breaking
on the condition desc->len == 0:

vb_avail = 4294967284  (0xfff4)
seg_avail = 2608
vb_offset = 12
cpy_len = 2608
seg_num = 1
desc = 0x2aadb6e5c000
vb_addr = 46928960159744
entry_success = 0

Note also that there is no crash despite to the desc->addr being zero,
it is a valid address in the regions mapped to the device. Although, the
3 regions mapped does not seem to be correct either at this stage.


The versions that I'm running are OVS 2.4.0, with corrections from the
2.4 branch, and DPDK 2.1.0. QEMU emulator version 2.2.0 and
libvirt version 1.2.12.


Regards,

Patrik