[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-22 Thread Thomas Monjalon
2016-07-21 21:35, Yuanhan Liu:
> On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
> > If something abnormal happened to QEMU, 'connect()' can block calling
> > thread (e.g. main thread of OVS) forever or for a really long time.
> > This can break whole application or block the reconnection thread.
> > 
> > Example with OVS:
> > 
> > ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
> > (gdb) bt
> > #0  connect () from /lib64/libpthread.so.0
> > #1  vhost_user_create_client (vsocket=0xa816e0)
> > #2  rte_vhost_driver_register
> > #3  netdev_dpdk_vhost_user_construct
> > #4  netdev_open (name=0xa664b0 "vhost1")
> > [...]
> > #11 main
> > 
> > Fix that by setting non-blocking mode for client sockets for connection.
> > 
> > Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> > 
> > Signed-off-by: Ilya Maximets 
> 
> Acked-by: Yuanhan Liu 

Applied, thanks


[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu
On Thu, Jul 21, 2016 at 04:43:25PM +0300, Ilya Maximets wrote:
> 
> 
> On 21.07.2016 16:35, Yuanhan Liu wrote:
> > On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
> >> If something abnormal happened to QEMU, 'connect()' can block calling
> >> thread (e.g. main thread of OVS) forever or for a really long time.
> >> This can break whole application or block the reconnection thread.
> >>
> >> Example with OVS:
> >>
> >>ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
> >>(gdb) bt
> >>#0  connect () from /lib64/libpthread.so.0
> >>#1  vhost_user_create_client (vsocket=0xa816e0)
> >>#2  rte_vhost_driver_register
> >>#3  netdev_dpdk_vhost_user_construct
> >>#4  netdev_open (name=0xa664b0 "vhost1")
> >>[...]
> >>#11 main
> >>
> >> Fix that by setting non-blocking mode for client sockets for connection.
> >>
> >> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> >>
> >> Signed-off-by: Ilya Maximets 
> > 
> > Acked-by: Yuanhan Liu 
> > 
> > One help I'd like to ask is that I'd appriciate if you could do the test
> > to make sure that your 2 (latest) patches fix the two issues you reported.
> > 
> > You might have already done that; I just want to make sure.
> 
> I've performed the test with 'ofport_request' script before sending patches.
> And currently test still works. No leaks of descriptors, no hangs,
> no QEMU crashes observed.
> Sometimes network device breaks on QEMU side, but it's QEMU issue. In this
> case I'm receiving following message from DPDK's vhost:
> 
> VHOST_CONFIG: vhost-user client: socket created, fd: 28
> VHOST_CONFIG: failed to connect to /vhost1: Resource temporarily unavailable
> VHOST_CONFIG: /vhost1: reconnecting...
> 
> Before the 'hang' patch there was hang of main thread.
> 
> After QEMU restart all works normally. OVS restart not required.

Good to know and appreciate that!

--yliu



[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu
On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
> If something abnormal happened to QEMU, 'connect()' can block calling
> thread (e.g. main thread of OVS) forever or for a really long time.
> This can break whole application or block the reconnection thread.
> 
> Example with OVS:
> 
>   ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>   (gdb) bt
>   #0  connect () from /lib64/libpthread.so.0
>   #1  vhost_user_create_client (vsocket=0xa816e0)
>   #2  rte_vhost_driver_register
>   #3  netdev_dpdk_vhost_user_construct
>   #4  netdev_open (name=0xa664b0 "vhost1")
>   [...]
>   #11 main
> 
> Fix that by setting non-blocking mode for client sockets for connection.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> 
> Signed-off-by: Ilya Maximets 

Acked-by: Yuanhan Liu 

One help I'd like to ask is that I'd appriciate if you could do the test
to make sure that your 2 (latest) patches fix the two issues you reported.

You might have already done that; I just want to make sure.

Thanks.

--yliu


[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets


On 21.07.2016 16:35, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
>> If something abnormal happened to QEMU, 'connect()' can block calling
>> thread (e.g. main thread of OVS) forever or for a really long time.
>> This can break whole application or block the reconnection thread.
>>
>> Example with OVS:
>>
>>  ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>>  (gdb) bt
>>  #0  connect () from /lib64/libpthread.so.0
>>  #1  vhost_user_create_client (vsocket=0xa816e0)
>>  #2  rte_vhost_driver_register
>>  #3  netdev_dpdk_vhost_user_construct
>>  #4  netdev_open (name=0xa664b0 "vhost1")
>>  [...]
>>  #11 main
>>
>> Fix that by setting non-blocking mode for client sockets for connection.
>>
>> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
>>
>> Signed-off-by: Ilya Maximets 
> 
> Acked-by: Yuanhan Liu 
> 
> One help I'd like to ask is that I'd appriciate if you could do the test
> to make sure that your 2 (latest) patches fix the two issues you reported.
> 
> You might have already done that; I just want to make sure.

I've performed the test with 'ofport_request' script before sending patches.
And currently test still works. No leaks of descriptors, no hangs,
no QEMU crashes observed.
Sometimes network device breaks on QEMU side, but it's QEMU issue. In this
case I'm receiving following message from DPDK's vhost:

VHOST_CONFIG: vhost-user client: socket created, fd: 28
VHOST_CONFIG: failed to connect to /vhost1: Resource temporarily unavailable
VHOST_CONFIG: /vhost1: reconnecting...

Before the 'hang' patch there was hang of main thread.

After QEMU restart all works normally. OVS restart not required.

Best regards, Ilya Maximets.


[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets
If something abnormal happened to QEMU, 'connect()' can block calling
thread (e.g. main thread of OVS) forever or for a really long time.
This can break whole application or block the reconnection thread.

Example with OVS:

ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
(gdb) bt
#0  connect () from /lib64/libpthread.so.0
#1  vhost_user_create_client (vsocket=0xa816e0)
#2  rte_vhost_driver_register
#3  netdev_dpdk_vhost_user_construct
#4  netdev_open (name=0xa664b0 "vhost1")
[...]
#11 main

Fix that by setting non-blocking mode for client sockets for connection.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Signed-off-by: Ilya Maximets 
---
This was reproduced with current QEMU master branch
(commit 1ecfb24da987b862f) + patch-set "vhost-user reconnect fixes"
(https://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg01547.html).

OVS was patched to support client mode:
http://openvswitch.org/pipermail/dev/2016-July/074972.html

Following script forces QEMU to fail to initialize vhost because
disconnection occures while device not fully configured:

while true
do
ovs-vsctl set Interface vhost1 ofport_request=125
ovs-vsctl set Interface vhost1 ofport_request=126
done

As a result: QEMU still works, network interface broken and OVS main
 thread stalled inside 'connect()'.

Version 3:
* Removed unnecessary 'getsockopt()' check.

Version 2:
* EINPROGRESS not checked. EISCONN checked instead on
  the next iteration of reconnection loop.



 lib/librte_vhost/vhost_user/vhost-net-user.c | 52 +---
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 62c5ec3..b35594d 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -449,6 +450,14 @@ create_unix_socket(const char *path, struct sockaddr_un 
*un, bool is_server)
RTE_LOG(INFO, VHOST_CONFIG, "vhost-user %s: socket created, fd: %d\n",
is_server ? "server" : "client", fd);

+   if (!is_server && fcntl(fd, F_SETFL, O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "vhost-user: can't set nonblocking mode for socket, fd: 
"
+   "%d (%s)\n", fd, strerror(errno));
+   close(fd);
+   return -1;
+   }
+
memset(un, 0, sizeof(*un));
un->sun_family = AF_UNIX;
strncpy(un->sun_path, path, sizeof(un->sun_path));
@@ -516,9 +525,33 @@ struct vhost_user_reconnect_list {
 static struct vhost_user_reconnect_list reconn_list;
 static pthread_t reconn_tid;

+static int
+vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
+{
+   int ret, flags;
+
+   ret = connect(fd, un, sz);
+   if (ret < 0 && errno != EISCONN)
+   return -1;
+
+   flags = fcntl(fd, F_GETFL, 0);
+   if (flags < 0) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't get flags for connfd %d\n", fd);
+   return -2;
+   }
+   if ((flags & O_NONBLOCK) && fcntl(fd, F_SETFL, flags & ~O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't disable nonblocking on fd %d\n", fd);
+   return -2;
+   }
+   return 0;
+}
+
 static void *
 vhost_user_client_reconnect(void *arg __rte_unused)
 {
+   int ret;
struct vhost_user_reconnect *reconn, *next;

while (1) {
@@ -532,13 +565,23 @@ vhost_user_client_reconnect(void *arg __rte_unused)
 reconn != NULL; reconn = next) {
next = TAILQ_NEXT(reconn, next);

-   if (connect(reconn->fd, (struct sockaddr *)>un,
-   sizeof(reconn->un)) < 0)
+   ret = vhost_user_connect_nonblock(reconn->fd,
+   (struct sockaddr *)>un,
+   sizeof(reconn->un));
+   if (ret == -2) {
+   close(reconn->fd);
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "reconnection for fd %d failed\n",
+   reconn->fd);
+   goto remove_fd;
+   }
+   if (ret == -1)
continue;

RTE_LOG(INFO, VHOST_CONFIG,
"%s: connected\n", reconn->vsocket->path);
vhost_user_add_connection(reconn->fd, reconn->vsocket);
+remove_fd: