Re: QAT intermittent healthcheck errors

2019-05-13 Thread Marcin Deranek

Hi Emeric,

On 5/13/19 11:06 AM, Emeric Brun wrote:


Just to known that I'm waiting for the feedback of intel's team and I will 
receive QAT 1.7 compliant hardware soon to make some tests here.


Thank you for an update.
Regards,

Marcin Deranek



Re: [External] Re: QAT intermittent healthcheck errors

2019-05-13 Thread Emeric Brun
Hi Marcin,

> 
> Thank you Marcin, It shows that haproxy is waiting for an event on all those 
> fds because a crypto jobs were launched on the engine 
> and we can't free the session until the end of this job (it would result in a 
> segfault).
> 
> So the processes are stucked, unable to free the session because the engine 
> doesn't signal the end of those job via the async fd.
> 
> I didn't reproduce this issue on QAT 1.5 so I will try to discuss it with 
> intel guys to known why there is this behavior change in the v1.7
> and what we can do.
> 
> R,
> Emeric
> 

Just to known that I'm waiting for the feedback of intel's team and I will 
receive QAT 1.7 compliant hardware soon to make some tests here.

R,
Emeric




Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Emeric Brun
On 5/7/19 3:35 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 5/7/19 1:53 PM, Emeric Brun wrote:
>> On 5/7/19 1:24 PM, Marcin Deranek wrote:
>>> Hi Emeric,
>>>
>>> On 5/7/19 11:44 AM, Emeric Brun wrote:
 Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see 
 attachment for end result). Unfortunately after applying the patch there 
 is no change in behavior: we still leak /dev/usdm_drv descriptors and have 
 "stuck" HAProxy instances after reload..
>>> Regards,
>>
>>

 Could you perform a test recompiling the usdm_drv and the engine with this 
 patch, it applies on QAT 1.7 but I've no hardware to test this version 
 here.

 It should fix the fd leak.
>>>
>>> It did fix fd leak:
>>>
>>> # ls -al /proc/2565/fd|fgrep dev
>>> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
>>> lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv
>>>
>>> # systemctl reload haproxy.service
>>> # ls -al /proc/2565/fd|fgrep dev
>>> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
>>> lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv
>>>
>>> # systemctl reload haproxy.service
>>> # ls -al /proc/2565/fd|fgrep dev
>>> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
>>> lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv
>>>
>>> But there are still stuck processes :-( This is with both patches included: 
>>> for QAT and HAProxy.
>>> Regards,
>>>
>>> Marcin Deranek
>>
>> Thank you Marcin! Anyway it's was also a bug.
>>
>> Could you process a 'show fds' command on a stucked process adding the patch 
>> in attachement.
> 
> I did apply this patch and all previous patches (QAT + HAProxy 
> ssl_free_engine). This is what I got after 1st reload:
> 
> show proc
> #         
> 8025    master  0   1   0d 00h03m25s
> # workers
> 31269   worker  1   0   0d 00h00m39s
> 31270   worker  2   0   0d 00h00m39s
> 31271   worker  3   0   0d 00h00m39s
> 31272   worker  4   0   0d 00h00m39s
> # old workers
> 9286    worker  [was: 1]    1   0d 00h03m25s
> 9287    worker  [was: 2]    1   0d 00h03m25s
> 9288    worker  [was: 3]    1   0d 00h03m25s
> 9289    worker  [was: 4]    1   0d 00h03m25s
> 
> @!9286 show fd
>  13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x23eaae0 
> iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0
>  16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e1ab0 
> iocb=0x4e1ab0(thread_sync_io_handler) tmask=0x umask=0x0
>  20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1601b840 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>  21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x1f0ec4f0 
> iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL 
> mux=PASS mux_ctx=0x22ad8630
>    1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bab1f30 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x247e5bc0 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x18883650 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x14476c10 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11a27850 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12008230 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bb0a570 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c94790 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1449e050 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1f00c150 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x15f40550 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x124b6340 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11fe4500 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c70a60 
> 

Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Marcin Deranek

Hi Emeric,

On 5/7/19 1:53 PM, Emeric Brun wrote:

On 5/7/19 1:24 PM, Marcin Deranek wrote:

Hi Emeric,

On 5/7/19 11:44 AM, Emeric Brun wrote:

Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end 
result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv 
descriptors and have "stuck" HAProxy instances after reload..

Regards,





Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.


It did fix fd leak:

# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv

But there are still stuck processes :-( This is with both patches included: for 
QAT and HAProxy.
Regards,

Marcin Deranek


Thank you Marcin! Anyway it's was also a bug.

Could you process a 'show fds' command on a stucked process adding the patch in 
attachement.


I did apply this patch and all previous patches (QAT + HAProxy 
ssl_free_engine). This is what I got after 1st reload:


show proc
# 
8025master  0   1   0d 00h03m25s
# workers
31269   worker  1   0   0d 00h00m39s
31270   worker  2   0   0d 00h00m39s
31271   worker  3   0   0d 00h00m39s
31272   worker  4   0   0d 00h00m39s
# old workers
9286worker  [was: 1]1   0d 00h03m25s
9287worker  [was: 2]1   0d 00h03m25s
9288worker  [was: 3]1   0d 00h03m25s
9289worker  [was: 4]1   0d 00h03m25s

@!9286 show fd
 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 
owner=0x23eaae0 iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0
 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x4e1ab0 iocb=0x4e1ab0(thread_sync_io_handler) 
tmask=0x umask=0x0
 20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1601b840 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
 21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 
owner=0x1f0ec4f0 iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 
cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x22ad8630
   1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1bab1f30 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x247e5bc0 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x18883650 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x14476c10 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11a27850 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x12008230 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1bb0a570 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11c94790 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1449e050 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1f00c150 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x15f40550 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x124b6340 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11fe4500 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11c70a60 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1427 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x12572540 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1428 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1249a420 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1430 : st=0x05(R:PrA 

Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Emeric Brun
On 5/7/19 1:24 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 5/7/19 11:44 AM, Emeric Brun wrote:
>> Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see 
>> attachment for end result). Unfortunately after applying the patch there is 
>> no change in behavior: we still leak /dev/usdm_drv descriptors and have 
>> "stuck" HAProxy instances after reload..
> Regards,


>>
>> Could you perform a test recompiling the usdm_drv and the engine with this 
>> patch, it applies on QAT 1.7 but I've no hardware to test this version here.
>>
>> It should fix the fd leak.
> 
> It did fix fd leak:
> 
> # ls -al /proc/2565/fd|fgrep dev
> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
> lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv
> 
> # systemctl reload haproxy.service
> # ls -al /proc/2565/fd|fgrep dev
> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
> lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv
> 
> # systemctl reload haproxy.service
> # ls -al /proc/2565/fd|fgrep dev
> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
> lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv
> 
> But there are still stuck processes :-( This is with both patches included: 
> for QAT and HAProxy.
> Regards,
> 
> Marcin Deranek

Thank you Marcin! Anyway it's was also a bug.

Could you process a 'show fds' command on a stucked process adding the patch in 
attachement.

R,
Emeric

>From d0e095c2aa54f020de8fc50db867eff1ef73350e Mon Sep 17 00:00:00 2001
From: Emeric Brun 
Date: Fri, 19 Apr 2019 17:15:28 +0200
Subject: [PATCH] MINOR: ssl/cli: async fd io-handlers printable on show fd

This patch exports the async fd iohandlers and make them printable
doing a 'show fd' on cli.
---
 include/proto/ssl_sock.h | 4 
 src/cli.c| 9 +
 src/ssl_sock.c   | 4 ++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/proto/ssl_sock.h b/include/proto/ssl_sock.h
index 62ebcb87..ce52fb74 100644
--- a/include/proto/ssl_sock.h
+++ b/include/proto/ssl_sock.h
@@ -85,6 +85,10 @@ SSL_CTX *ssl_sock_get_generated_cert(unsigned int key, struct bind_conf *bind_co
 int ssl_sock_set_generated_cert(SSL_CTX *ctx, unsigned int key, struct bind_conf *bind_conf);
 unsigned int ssl_sock_generated_cert_key(const void *data, size_t len);
 
+#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC)
+void ssl_async_fd_handler(int fd);
+void ssl_async_fd_free(int fd);
+#endif
 
 /* ssl shctx macro */
 
diff --git a/src/cli.c b/src/cli.c
index 568ceba2..843c3d04 100644
--- a/src/cli.c
+++ b/src/cli.c
@@ -69,6 +69,9 @@
 #include 
 #include 
 #include 
+#ifdef USE_OPENSSL
+#include 
+#endif
 
 #define PAYLOAD_PATTERN "<<"
 
@@ -998,6 +1001,12 @@ static int cli_io_handler_show_fd(struct appctx *appctx)
 			 (fdt.iocb == listener_accept)  ? "listener_accept" :
 			 (fdt.iocb == poller_pipe_io_handler) ? "poller_pipe_io_handler" :
 			 (fdt.iocb == mworker_accept_wrapper) ? "mworker_accept_wrapper" :
+#ifdef USE_OPENSSL
+#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC)
+			 (fdt.iocb == ssl_async_fd_free) ? "ssl_async_fd_free" :
+			 (fdt.iocb == ssl_async_fd_handler) ? "ssl_async_fd_handler" :
+#endif
+#endif
 			 "unknown");
 
 		if (fdt.iocb == conn_fd_handler) {
diff --git a/src/ssl_sock.c b/src/ssl_sock.c
index 112520c8..58ae8a26 100644
--- a/src/ssl_sock.c
+++ b/src/ssl_sock.c
@@ -573,7 +573,7 @@ fail_get:
 /*
  * openssl async fd handler
  */
-static void ssl_async_fd_handler(int fd)
+void ssl_async_fd_handler(int fd)
 {
 	struct connection *conn = fdtab[fd].owner;
 
@@ -594,7 +594,7 @@ static void ssl_async_fd_handler(int fd)
 /*
  * openssl async delayed SSL_free handler
  */
-static void ssl_async_fd_free(int fd)
+void ssl_async_fd_free(int fd)
 {
 	SSL *ssl = fdtab[fd].owner;
 	OSSL_ASYNC_FD all_fd[32];
-- 
2.17.1



Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Marcin Deranek

Hi Emeric,

On 5/7/19 11:44 AM, Emeric Brun wrote:

Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end 
result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv 
descriptors and have "stuck" HAProxy instances after reload..

Regards,





Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.


It did fix fd leak:

# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv

But there are still stuck processes :-( This is with both patches 
included: for QAT and HAProxy.

Regards,

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-05-07 Thread Marcin Deranek

On 5/7/19 11:44 AM, Emeric Brun wrote:


Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.


Will do and report back.

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-05-07 Thread Emeric Brun
Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment 
for end result). Unfortunately after applying the patch there is no change in 
behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy 
instances after reload..
>>> Regards,
>>
>>

Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.

R,
Emeric
diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c
--- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c	2019-05-07 11:35:15.654202291 +0200
+++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c	2019-05-07 11:35:44.302292417 +0200
@@ -104,7 +104,7 @@
 /* standard page size */
 page_size = getpagesize();
 
-fd = qae_open("/proc/self/pagemap", O_RDONLY);
+fd = qae_open("/proc/self/pagemap", O_RDONLY|O_CLOEXEC);
 if (fd < 0)
 {
 return 0;
diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c
--- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c	2019-03-15 15:23:43.0 +0100
+++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c	2019-05-07 11:24:08.755921241 +0200
@@ -745,7 +745,7 @@
 
 if (fd > 0)
 close(fd);
-fd = qae_open(QAE_MEM, O_RDWR);
+fd = qae_open(QAE_MEM, O_RDWR|O_CLOEXEC);
 if (fd < 0)
 {
 CMD_ERROR("%s:%d Unable to initialize memory file handle %s \n",


Re: QAT intermittent healthcheck errors

2019-05-06 Thread Emeric Brun
Hi Marcin,

On 5/6/19 3:31 PM, Emeric Brun wrote:
> Hi Marcin,
> 
> On 5/6/19 3:15 PM, Marcin Deranek wrote:
>> Hi Emeric,
>>
>> On 5/3/19 5:54 PM, Emeric Brun wrote:
>>> Hi Marcin,
>>>
>>> On 5/3/19 4:56 PM, Marcin Deranek wrote:
 Hi Emeric,

 On 5/3/19 4:50 PM, Emeric Brun wrote:

> I've a testing platform here but I don't use the usdm_drv but the 
> qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as 
> the doc says to use with my chip) .

 I see. I use qat 1.7 and qat-engine 0.5.40.

> Anyway, could you re-compile a haproxy's binary if I provide you a 
> testing patch?

 Sure, that should not be a problem.
>>>
>>> The patch in attachment.
>>
>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end 
>> result). Unfortunately after applying the patch there is no change in 
>> behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy 
>> instances after reload..
>> Regards,
> 
> 
> Ok, the patch adds a ENGINE_finish() call before the reload. I was supposing 
> that the ENGINE_finish would perform the close of the fd because on the 
> application side there is no different way to interact with the engine.
> 
> Unfortunately, this is not the case. So I will ask to the intel guys what we 
> supposed to do to close this fd.


I've just written to my contact at intel.

Just for note: I'm using hardware supported with QAT 1.5 in this version tu 
usdm_drv was not present and I use the other option qat_contig_mem which seems 
to not cause such fd leak.

Perhaps to switch to this one would be a work-around if you want to continue to 
perform test waiting for intel's guy reply.

R,
Emeric



Re: QAT intermittent healthcheck errors

2019-05-06 Thread Emeric Brun
Hi Marcin,

On 5/6/19 3:15 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 5/3/19 5:54 PM, Emeric Brun wrote:
>> Hi Marcin,
>>
>> On 5/3/19 4:56 PM, Marcin Deranek wrote:
>>> Hi Emeric,
>>>
>>> On 5/3/19 4:50 PM, Emeric Brun wrote:
>>>
 I've a testing platform here but I don't use the usdm_drv but the 
 qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the 
 doc says to use with my chip) .
>>>
>>> I see. I use qat 1.7 and qat-engine 0.5.40.
>>>
 Anyway, could you re-compile a haproxy's binary if I provide you a testing 
 patch?
>>>
>>> Sure, that should not be a problem.
>>
>> The patch in attachment.
> 
> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end 
> result). Unfortunately after applying the patch there is no change in 
> behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy 
> instances after reload..
> Regards,


Ok, the patch adds a ENGINE_finish() call before the reload. I was supposing 
that the ENGINE_finish would perform the close of the fd because on the 
application side there is no different way to interact with the engine.

Unfortunately, this is not the case. So I will ask to the intel guys what we 
supposed to do to close this fd.

R,
Emeric 




Re: [External] Re: QAT intermittent healthcheck errors

2019-05-03 Thread Emeric Brun
Hi Marcin,

On 5/3/19 4:56 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 5/3/19 4:50 PM, Emeric Brun wrote:
> 
>> I've a testing platform here but I don't use the usdm_drv but the 
>> qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the 
>> doc says to use with my chip) .
> 
> I see. I use qat 1.7 and qat-engine 0.5.40.
> 
>> Anyway, could you re-compile a haproxy's binary if I provide you a testing 
>> patch?
> 
> Sure, that should not be a problem.

The patch in attachment.
> 
>> The idea is to perform a deinit in the master to force a close of those 
>> '/dev's at each reload. Perhaps It won't fix our issue but this leak of fd 
>> should not be.
> 
> Hope this will give us at least some more insight..
> Regards,
> 
> Marcin Deranek

R,
Emeric
>From ca57857a492e898759ef211a8fd9714d0f7dd7fa Mon Sep 17 00:00:00 2001
From: Emeric Brun 
Date: Fri, 3 May 2019 17:06:59 +0200
Subject: [PATCH] BUG/MEDIUM: ssl: fix ssl engine's open fds are leaking.

The master didn't call the engine deinit, resulting
in a leak of fd opened by the engine during init. The
workers inherit of these accumulated fds at each reload.

This patch add a call to engine deinit on the master just
before reloading with an exec.
---
 src/haproxy.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/haproxy.c b/src/haproxy.c
index 603f084c..f77eb1b4 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -588,6 +588,13 @@ void mworker_reload()
 	if (fdtab)
 		deinit_pollers();
 
+#if defined(USE_OPENSSL)
+#ifndef OPENSSL_NO_ENGINE
+	/* Engines may have opened fds and we must close them */
+	ssl_free_engines();
+#endif
+#endif
+
 	/* restore the initial FD limits */
 	limit.rlim_cur = rlim_fd_cur_at_boot;
 	limit.rlim_max = rlim_fd_max_at_boot;
-- 
2.17.1



Re: [External] Re: QAT intermittent healthcheck errors

2019-05-03 Thread Marcin Deranek

Hi Emeric,

On 5/3/19 4:50 PM, Emeric Brun wrote:


I've a testing platform here but I don't use the usdm_drv but the 
qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the doc 
says to use with my chip) .


I see. I use qat 1.7 and qat-engine 0.5.40.


Anyway, could you re-compile a haproxy's binary if I provide you a testing 
patch?


Sure, that should not be a problem.



The idea is to perform a deinit in the master to force a close of those '/dev's 
at each reload. Perhaps It won't fix our issue but this leak of fd should not 
be.


Hope this will give us at least some more insight..
Regards,

Marcin Deranek


On 5/3/19 4:21 PM, Marcin Deranek wrote:

Hi Emeric,

It looks like on every reload master leaks /dev/usdm_drv device:

# systemctl restart haproxy.service
# ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
lrwx-- 1 root root 64 May  3 15:40 9 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
lrwx-- 1 root root 64 May  3 15:40 10 -> /dev/usdm_drv
lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
lrwx-- 1 root root 64 May  3 15:40 9 -> /dev/usdm_drv

Obviously workers do inherit this from the master. Looking at workers I see the 
following:

* 1st gen:

# ls -al /proc/36083/fd|awk '/dev/ {print $NF}'|sort
/dev/null
/dev/null
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_dev_processes
/dev/uio19
/dev/uio3
/dev/uio35
/dev/usdm_drv

* 2nd gen:

# ls -al /proc/41637/fd|awk '/dev/ {print $NF}'|sort
/dev/null
/dev/null
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_dev_processes
/dev/uio23
/dev/uio39
/dev/uio7
/dev/usdm_drv
/dev/usdm_drv

Looks like only /dev/usdm_drv is leaked.

Cheers,

Marcin Deranek

On 5/3/19 2:22 PM, Emeric Brun wrote:

Hi Marcin,

On 4/29/19 6:41 PM, Marcin Deranek wrote:

Hi Emeric,

On 4/29/19 3:42 PM, Emeric Brun wrote:

Hi Marcin,




I've also a contact at intel who told me to try this option on the qat engine:


--disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
    Disable/Enable the engine from being initialized automatically 
following a
    fork operation. This is useful in a situation where you want to tightly
    control how many instances are being used for processes. For instance 
if an
    application forks to start a process that does not utilize QAT currently
    the default behaviour is for the engine to still automatically get 
started
    in the child using up an engine instance. After using this flag either 
the
    engine needs to be initialized manually using the engine message:
    INIT_ENGINE or will automatically get initialized on the first QAT 
crypto
    operation. The initialization on fork is enabled by default.


I tried to build QAT Engine with disabled auto init, but that did not help. Now 
I get the following during startup:

2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to 
initialize memory file handle /dev/usdm_drv
2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
[29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure


" INIT_ENGINE or will automatically get initialized on the first QAT crypto 
operation"

Perhaps the init appears "with first qat crypto operation" and is delayed after 
the fork so if a chroot is configured, it doesn't allow some accesses
to /dev. Could you perform a test in that case without chroot enabled in the 
haproxy config ?


Removed chroot and now it initializes properly. Unfortunately reload still causes 
"stuck" HAProxy process :-(

Marcin Deranek


Could you check with "ls -l /proc//fd" if the "/dev/" is 
open multiple times after a reload?

Emeric







Re: [External] Re: QAT intermittent healthcheck errors

2019-05-03 Thread Emeric Brun
Hi Marcin,

Good so we progress!

I've a testing platform here but I don't use the usdm_drv but the 
qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the doc 
says to use with my chip) .

Anyway, could you re-compile a haproxy's binary if I provide you a testing 
patch?

The idea is to perform a deinit in the master to force a close of those '/dev's 
at each reload. Perhaps It won't fix our issue but this leak of fd should not 
be.

R,
Emeric

On 5/3/19 4:21 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> It looks like on every reload master leaks /dev/usdm_drv device:
> 
> # systemctl restart haproxy.service
> # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
> lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
> lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
> 
> # systemctl reload haproxy.service
> # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
> lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
> lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
> lrwx-- 1 root root 64 May  3 15:40 9 -> /dev/usdm_drv
> 
> # systemctl reload haproxy.service
> # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
> lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
> lrwx-- 1 root root 64 May  3 15:40 10 -> /dev/usdm_drv
> lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
> lrwx-- 1 root root 64 May  3 15:40 9 -> /dev/usdm_drv
> 
> Obviously workers do inherit this from the master. Looking at workers I see 
> the following:
> 
> * 1st gen:
> 
> # ls -al /proc/36083/fd|awk '/dev/ {print $NF}'|sort
> /dev/null
> /dev/null
> /dev/qat_adf_ctl
> /dev/qat_adf_ctl
> /dev/qat_adf_ctl
> /dev/qat_dev_processes
> /dev/uio19
> /dev/uio3
> /dev/uio35
> /dev/usdm_drv
> 
> * 2nd gen:
> 
> # ls -al /proc/41637/fd|awk '/dev/ {print $NF}'|sort
> /dev/null
> /dev/null
> /dev/qat_adf_ctl
> /dev/qat_adf_ctl
> /dev/qat_adf_ctl
> /dev/qat_dev_processes
> /dev/uio23
> /dev/uio39
> /dev/uio7
> /dev/usdm_drv
> /dev/usdm_drv
> 
> Looks like only /dev/usdm_drv is leaked.
> 
> Cheers,
> 
> Marcin Deranek
> 
> On 5/3/19 2:22 PM, Emeric Brun wrote:
>> Hi Marcin,
>>
>> On 4/29/19 6:41 PM, Marcin Deranek wrote:
>>> Hi Emeric,
>>>
>>> On 4/29/19 3:42 PM, Emeric Brun wrote:
 Hi Marcin,

>
>> I've also a contact at intel who told me to try this option on the qat 
>> engine:
>>
>>> --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
>>>    Disable/Enable the engine from being initialized automatically 
>>> following a
>>>    fork operation. This is useful in a situation where you want to 
>>> tightly
>>>    control how many instances are being used for processes. For 
>>> instance if an
>>>    application forks to start a process that does not utilize QAT 
>>> currently
>>>    the default behaviour is for the engine to still automatically 
>>> get started
>>>    in the child using up an engine instance. After using this flag 
>>> either the
>>>    engine needs to be initialized manually using the engine message:
>>>    INIT_ENGINE or will automatically get initialized on the first 
>>> QAT crypto
>>>    operation. The initialization on fork is enabled by default.
>
> I tried to build QAT Engine with disabled auto init, but that did not 
> help. Now I get the following during startup:
>
> 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 
> Unable to initialize memory file handle /dev/usdm_drv
> 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
> [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure

 " INIT_ENGINE or will automatically get initialized on the first QAT 
 crypto operation"

 Perhaps the init appears "with first qat crypto operation" and is delayed 
 after the fork so if a chroot is configured, it doesn't allow some accesses
 to /dev. Could you perform a test in that case without chroot enabled in 
 the haproxy config ?
>>>
>>> Removed chroot and now it initializes properly. Unfortunately reload still 
>>> causes "stuck" HAProxy process :-(
>>>
>>> Marcin Deranek
>>
>> Could you check with "ls -l /proc//fd" if the "/dev/" 
>> is open multiple times after a reload?
>>
>> Emeric
>>




Re: [External] Re: QAT intermittent healthcheck errors

2019-05-03 Thread Marcin Deranek

Hi Emeric,

It looks like on every reload master leaks /dev/usdm_drv device:

# systemctl restart haproxy.service
# ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
lrwx-- 1 root root 64 May  3 15:40 9 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev
lr-x-- 1 root root 64 May  3 15:40 0 -> /dev/null
lrwx-- 1 root root 64 May  3 15:40 10 -> /dev/usdm_drv
lrwx-- 1 root root 64 May  3 15:40 7 -> /dev/usdm_drv
lrwx-- 1 root root 64 May  3 15:40 9 -> /dev/usdm_drv

Obviously workers do inherit this from the master. Looking at workers I 
see the following:


* 1st gen:

# ls -al /proc/36083/fd|awk '/dev/ {print $NF}'|sort
/dev/null
/dev/null
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_dev_processes
/dev/uio19
/dev/uio3
/dev/uio35
/dev/usdm_drv

* 2nd gen:

# ls -al /proc/41637/fd|awk '/dev/ {print $NF}'|sort
/dev/null
/dev/null
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_adf_ctl
/dev/qat_dev_processes
/dev/uio23
/dev/uio39
/dev/uio7
/dev/usdm_drv
/dev/usdm_drv

Looks like only /dev/usdm_drv is leaked.

Cheers,

Marcin Deranek

On 5/3/19 2:22 PM, Emeric Brun wrote:

Hi Marcin,

On 4/29/19 6:41 PM, Marcin Deranek wrote:

Hi Emeric,

On 4/29/19 3:42 PM, Emeric Brun wrote:

Hi Marcin,




I've also a contact at intel who told me to try this option on the qat engine:


--disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
   Disable/Enable the engine from being initialized automatically following 
a
   fork operation. This is useful in a situation where you want to tightly
   control how many instances are being used for processes. For instance if 
an
   application forks to start a process that does not utilize QAT currently
   the default behaviour is for the engine to still automatically get 
started
   in the child using up an engine instance. After using this flag either 
the
   engine needs to be initialized manually using the engine message:
   INIT_ENGINE or will automatically get initialized on the first QAT crypto
   operation. The initialization on fork is enabled by default.


I tried to build QAT Engine with disabled auto init, but that did not help. Now 
I get the following during startup:

2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to 
initialize memory file handle /dev/usdm_drv
2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
[29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure


" INIT_ENGINE or will automatically get initialized on the first QAT crypto 
operation"

Perhaps the init appears "with first qat crypto operation" and is delayed after 
the fork so if a chroot is configured, it doesn't allow some accesses
to /dev. Could you perform a test in that case without chroot enabled in the 
haproxy config ?


Removed chroot and now it initializes properly. Unfortunately reload still causes 
"stuck" HAProxy process :-(

Marcin Deranek


Could you check with "ls -l /proc//fd" if the "/dev/" is 
open multiple times after a reload?

Emeric





Re: [External] Re: QAT intermittent healthcheck errors

2019-05-03 Thread Emeric Brun
Hi Marcin,

On 4/29/19 6:41 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 4/29/19 3:42 PM, Emeric Brun wrote:
>> Hi Marcin,
>>
>>>
 I've also a contact at intel who told me to try this option on the qat 
 engine:

> --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
>   Disable/Enable the engine from being initialized automatically 
> following a
>   fork operation. This is useful in a situation where you want to 
> tightly
>   control how many instances are being used for processes. For 
> instance if an
>   application forks to start a process that does not utilize QAT 
> currently
>   the default behaviour is for the engine to still automatically get 
> started
>   in the child using up an engine instance. After using this flag 
> either the
>   engine needs to be initialized manually using the engine message:
>   INIT_ENGINE or will automatically get initialized on the first QAT 
> crypto
>   operation. The initialization on fork is enabled by default.
>>>
>>> I tried to build QAT Engine with disabled auto init, but that did not help. 
>>> Now I get the following during startup:
>>>
>>> 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 
>>> Unable to initialize memory file handle /dev/usdm_drv
>>> 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
>>> [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure
>>
>> " INIT_ENGINE or will automatically get initialized on the first QAT crypto 
>> operation"
>>
>> Perhaps the init appears "with first qat crypto operation" and is delayed 
>> after the fork so if a chroot is configured, it doesn't allow some accesses
>> to /dev. Could you perform a test in that case without chroot enabled in the 
>> haproxy config ?
> 
> Removed chroot and now it initializes properly. Unfortunately reload still 
> causes "stuck" HAProxy process :-(
> 
> Marcin Deranek

Could you check with "ls -l /proc//fd" if the "/dev/" is 
open multiple times after a reload?

Emeric



Re: [External] Re: QAT intermittent healthcheck errors

2019-04-29 Thread Marcin Deranek

Hi Emeric,

On 4/29/19 3:42 PM, Emeric Brun wrote:

Hi Marcin,




I've also a contact at intel who told me to try this option on the qat engine:


--disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
  Disable/Enable the engine from being initialized automatically following a
  fork operation. This is useful in a situation where you want to tightly
  control how many instances are being used for processes. For instance if 
an
  application forks to start a process that does not utilize QAT currently
  the default behaviour is for the engine to still automatically get started
  in the child using up an engine instance. After using this flag either the
  engine needs to be initialized manually using the engine message:
  INIT_ENGINE or will automatically get initialized on the first QAT crypto
  operation. The initialization on fork is enabled by default.


I tried to build QAT Engine with disabled auto init, but that did not help. Now 
I get the following during startup:

2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to 
initialize memory file handle /dev/usdm_drv
2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
[29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure


" INIT_ENGINE or will automatically get initialized on the first QAT crypto 
operation"

Perhaps the init appears "with first qat crypto operation" and is delayed after 
the fork so if a chroot is configured, it doesn't allow some accesses
to /dev. Could you perform a test in that case without chroot enabled in the 
haproxy config ?


Removed chroot and now it initializes properly. Unfortunately reload 
still causes "stuck" HAProxy process :-(


Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-04-29 Thread Emeric Brun
Hi Marcin,

> 
>> I've also a contact at intel who told me to try this option on the qat 
>> engine:
>>
>>> --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
>>>  Disable/Enable the engine from being initialized automatically 
>>> following a
>>>  fork operation. This is useful in a situation where you want to tightly
>>>  control how many instances are being used for processes. For instance 
>>> if an
>>>  application forks to start a process that does not utilize QAT 
>>> currently
>>>  the default behaviour is for the engine to still automatically get 
>>> started
>>>  in the child using up an engine instance. After using this flag either 
>>> the
>>>  engine needs to be initialized manually using the engine message:
>>>  INIT_ENGINE or will automatically get initialized on the first QAT 
>>> crypto
>>>  operation. The initialization on fork is enabled by default.
> 
> I tried to build QAT Engine with disabled auto init, but that did not help. 
> Now I get the following during startup:
> 
> 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable 
> to initialize memory file handle /dev/usdm_drv
> 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
> [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure

" INIT_ENGINE or will automatically get initialized on the first QAT crypto 
operation"

Perhaps the init appears "with first qat crypto operation" and is delayed after 
the fork so if a chroot is configured, it doesn't allow some accesses
to /dev. Could you perform a test in that case without chroot enabled in the 
haproxy config ?

> 
> Probably engine is not manually initialized after forking.
> Regards,
> 
> Marcin Deranek

Emeric



Re: QAT intermittent healthcheck errors

2019-04-29 Thread Marcin Deranek

Hi Emeric,

On 4/29/19 2:47 PM, Emeric Brun wrote:

Hi Marcin,

On 4/19/19 3:26 PM, Marcin Deranek wrote:

Hi Emeric,

On 4/18/19 4:35 PM, Emeric Brun wrote:

An other interesting trace would be to perform a "show sess" command on a 
stucked process through the master cli.


And also the "show fd"


Here it is:

show proc
#         
13409   master  0   1   0d 00h03m30s
# workers
15084   worker  1   0   0d 00h03m20s
15085   worker  2   0   0d 00h03m20s
15086   worker  3   0   0d 00h03m20s
15087   worker  4   0   0d 00h03m20s
# old workers
13415   worker  [was: 1]    1   0d 00h03m30s
13416   worker  [was: 2]    1   0d 00h03m30s
13417   worker  [was: 3]    1   0d 00h03m30s
13418   worker  [was: 4]    1   0d 00h03m30s

@!13415 show sess
0x4eee9c0: proto=sockpair ts=0a age=0s calls=1 
rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] 
s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp=

@!13415 show fd
  13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x1a74ae0 
iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0
  16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 
iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0
  20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4fe1860 
iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL 
mux=PASS mux_ctx=0x47dfd50
  87 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3ec1150 
iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
  88 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3c237d0 
iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0

@!13416 show sess
0x48f2990: proto=sockpair ts=0a age=0s calls=1 
rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] 
s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp=

@!13416 show fd
  15 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x34c1540 
iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0
  16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 
iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0
  20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4b3cff0 
iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL 
mux=PASS mux_ctx=0x4f0e510
  75 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3a6b2f0 
iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
  76 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x43a34e0 
iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0

Marcin Deranek


87,88,75,76 appears to be async engine FDs and should be cleaned. I will dig 
for that.


Thank you.


I've also a contact at intel who told me to try this option on the qat engine:


--disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
 Disable/Enable the engine from being initialized automatically following a
 fork operation. This is useful in a situation where you want to tightly
 control how many instances are being used for processes. For instance if an
 application forks to start a process that does not utilize QAT currently
 the default behaviour is for the engine to still automatically get started
 in the child using up an engine instance. After using this flag either the
 engine needs to be initialized manually using the engine message:
 INIT_ENGINE or will automatically get initialized on the first QAT crypto
 operation. The initialization on fork is enabled by default.


I tried to build QAT Engine with disabled auto init, but that did not 
help. Now I get the following during startup:


2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 
Unable to initialize memory file handle /dev/usdm_drv
2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 
[29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure


Probably engine is not manually initialized after forking.
Regards,

Marcin Deranek



Re: [External] Re: QAT intermittent healthcheck errors

2019-04-29 Thread Emeric Brun
Hi Marcin,

On 4/19/19 3:26 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 4/18/19 4:35 PM, Emeric Brun wrote:
>>> An other interesting trace would be to perform a "show sess" command on a 
>>> stucked process through the master cli.
>>
>> And also the "show fd"
> 
> Here it is:
> 
> show proc
> #         
> 13409   master  0   1   0d 00h03m30s
> # workers
> 15084   worker  1   0   0d 00h03m20s
> 15085   worker  2   0   0d 00h03m20s
> 15086   worker  3   0   0d 00h03m20s
> 15087   worker  4   0   0d 00h03m20s
> # old workers
> 13415   worker  [was: 1]    1   0d 00h03m30s
> 13416   worker  [was: 2]    1   0d 00h03m30s
> 13417   worker  [was: 3]    1   0d 00h03m30s
> 13418   worker  [was: 4]    1   0d 00h03m30s
> 
> @!13415 show sess
> 0x4eee9c0: proto=sockpair ts=0a age=0s calls=1 
> rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] 
> s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp=
> 
> @!13415 show fd
>  13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x1a74ae0 
> iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0
>  16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 
> iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0
>  20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4fe1860 
> iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL 
> mux=PASS mux_ctx=0x47dfd50
>  87 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3ec1150 
> iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
>  88 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3c237d0 
> iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
> 
> @!13416 show sess
> 0x48f2990: proto=sockpair ts=0a age=0s calls=1 
> rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] 
> s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp=
> 
> @!13416 show fd
>  15 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x34c1540 
> iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0
>  16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 
> iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0
>  20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4b3cff0 
> iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL 
> mux=PASS mux_ctx=0x4f0e510
>  75 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3a6b2f0 
> iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
>  76 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x43a34e0 
> iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
> 
> Marcin Deranek

87,88,75,76 appears to be async engine FDs and should be cleaned. I will dig 
for that.

I've also a contact at intel who told me to try this option on the qat engine:

> --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork
> Disable/Enable the engine from being initialized automatically following a
> fork operation. This is useful in a situation where you want to tightly
> control how many instances are being used for processes. For instance if 
> an
> application forks to start a process that does not utilize QAT currently
> the default behaviour is for the engine to still automatically get started
> in the child using up an engine instance. After using this flag either the
> engine needs to be initialized manually using the engine message:
> INIT_ENGINE or will automatically get initialized on the first QAT crypto
> operation. The initialization on fork is enabled by default.


R,
Emeric



Re: [External] Re: QAT intermittent healthcheck errors

2019-04-19 Thread Marcin Deranek

Hi Emeric,

On 4/18/19 4:35 PM, Emeric Brun wrote:

An other interesting trace would be to perform a "show sess" command on a 
stucked process through the master cli.


And also the "show fd"


Here it is:

show proc
# 
13409   master  0   1   0d 00h03m30s
# workers
15084   worker  1   0   0d 00h03m20s
15085   worker  2   0   0d 00h03m20s
15086   worker  3   0   0d 00h03m20s
15087   worker  4   0   0d 00h03m20s
# old workers
13415   worker  [was: 1]1   0d 00h03m30s
13416   worker  [was: 2]1   0d 00h03m30s
13417   worker  [was: 3]1   0d 00h03m30s
13418   worker  [was: 4]1   0d 00h03m30s

@!13415 show sess
0x4eee9c0: proto=sockpair ts=0a age=0s calls=1 
rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] 
rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=20,ex=] 
s1=[7,4018h,fd=-1,ex=] exp=


@!13415 show fd
 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 
owner=0x1a74ae0 iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0
 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x4e19f0 iocb=0x4e19f0(thread_sync_io_handler) 
tmask=0x umask=0x0
 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 
owner=0x4fe1860 iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 
cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x47dfd50
 87 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x3ec1150 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
 88 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x3c237d0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0


@!13416 show sess
0x48f2990: proto=sockpair ts=0a age=0s calls=1 
rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] 
rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=20,ex=] 
s1=[7,4018h,fd=-1,ex=] exp=


@!13416 show fd
 15 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 
owner=0x34c1540 iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0
 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x4e19f0 iocb=0x4e19f0(thread_sync_io_handler) 
tmask=0x umask=0x0
 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 
owner=0x4b3cff0 iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 
cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x4f0e510
 75 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x3a6b2f0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0
 76 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x43a34e0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0


Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-04-19 Thread Marcin Deranek

On 4/18/19 11:06 AM, Emeric Brun wrote:

I think you can do that this way:

Remove the option httchk (or prefix it by "no": "no option httchk " if it is 
configured into the defaults section

and add the following 2 lines:

option tcp-check
tcp-check connect

This shouldn't perform the handshake but just validate that the port is open. 
The regular traffic will continue to use the ssl
on server side.


Enabling TCP checks has the very same effect as disabling them: reload 
works just fine.


Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-04-18 Thread Emeric Brun
On 4/18/19 11:06 AM, Emeric Brun wrote:
> Hi Marcin,
> 
> On 4/12/19 6:10 PM, Marcin Deranek wrote:
>> Hi Emeric,
>>
>> On 4/12/19 5:26 PM, Emeric Brun wrote:
>>
>>> Do you have ssl enabled on the server side?
>>
>> Yes, ssl is on frontend and backend with ssl checks enabled.
>>
>>> If it is the case could replace health check with a simple tcp check 
>>> (without ssl)?
>>
>> What I noticed before that if I (re)start HAProxy and reload immediately no 
>> stuck processes are present. If I wait before reloading stuck processes show 
>> up.
>> After disabling checks (I still keep ssl enabled for normal traffic) reloads 
>> work just fine (tried many time). Do you know how to enable TCP healthchecks 
>> while keeping SSL for non-healthcheck requests ?
> 
> I think you can do that this way:
> 
> Remove the option httchk (or prefix it by "no": "no option httchk " if it is 
> configured into the defaults section
> 
> and add the following 2 lines:
> 
> option tcp-check
> tcp-check connect
> 
> This shouldn't perform the handshake but just validate that the port is open. 
> The regular traffic will continue to use the ssl
> on server side.
> 
>  
>>> Regarding the show info/lsoff  it seems there is no more sessions on client 
>>> side but remaining ssl jobs (CurrSslConns) and I supsect the health checks 
>>> to miss a cleanup of their ssl sessions using the QAT. (this is just an 
>>> assumption)
>>
>> In general instance where I test QAT does not have any "real" client traffic 
>> except small amount of healtcheck requests per frontend which are internally 
>> handled by HAProxy itself. Still TLS handshake still needs to take place. 
>> There are many more backend healthchecks. Looks like your assumption was 
>> correct..
> 
> Good!, We continue to dig in that direction.
> 
> An other interesting trace would be to perform a "show sess" command on a 
> stucked process through the master cli.

And also the "show fd" 

R,
Emeric



Re: QAT intermittent healthcheck errors

2019-04-18 Thread Emeric Brun
Hi Marcin,

On 4/12/19 6:10 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 4/12/19 5:26 PM, Emeric Brun wrote:
> 
>> Do you have ssl enabled on the server side?
> 
> Yes, ssl is on frontend and backend with ssl checks enabled.
> 
>> If it is the case could replace health check with a simple tcp check 
>> (without ssl)?
> 
> What I noticed before that if I (re)start HAProxy and reload immediately no 
> stuck processes are present. If I wait before reloading stuck processes show 
> up.
> After disabling checks (I still keep ssl enabled for normal traffic) reloads 
> work just fine (tried many time). Do you know how to enable TCP healthchecks 
> while keeping SSL for non-healthcheck requests ?

I think you can do that this way:

Remove the option httchk (or prefix it by "no": "no option httchk " if it is 
configured into the defaults section

and add the following 2 lines:

option tcp-check
tcp-check connect

This shouldn't perform the handshake but just validate that the port is open. 
The regular traffic will continue to use the ssl
on server side.

 
>> Regarding the show info/lsoff  it seems there is no more sessions on client 
>> side but remaining ssl jobs (CurrSslConns) and I supsect the health checks 
>> to miss a cleanup of their ssl sessions using the QAT. (this is just an 
>> assumption)
> 
> In general instance where I test QAT does not have any "real" client traffic 
> except small amount of healtcheck requests per frontend which are internally 
> handled by HAProxy itself. Still TLS handshake still needs to take place. 
> There are many more backend healthchecks. Looks like your assumption was 
> correct..

Good!, We continue to dig in that direction.

An other interesting trace would be to perform a "show sess" command on a 
stucked process through the master cli.

R,
Emeric



Re: QAT intermittent healthcheck errors

2019-04-12 Thread Marcin Deranek

Hi Emeric,

On 4/12/19 5:26 PM, Emeric Brun wrote:


Do you have ssl enabled on the server side?


Yes, ssl is on frontend and backend with ssl checks enabled.


If it is the case could replace health check with a simple tcp check (without 
ssl)?


What I noticed before that if I (re)start HAProxy and reload immediately 
no stuck processes are present. If I wait before reloading stuck 
processes show up.
After disabling checks (I still keep ssl enabled for normal traffic) 
reloads work just fine (tried many time). Do you know how to enable TCP 
healthchecks while keeping SSL for non-healthcheck requests ?



Regarding the show info/lsoff  it seems there is no more sessions on client 
side but remaining ssl jobs (CurrSslConns) and I supsect the health checks to 
miss a cleanup of their ssl sessions using the QAT. (this is just an assumption)


In general instance where I test QAT does not have any "real" client 
traffic except small amount of healtcheck requests per frontend which 
are internally handled by HAProxy itself. Still TLS handshake still 
needs to take place. There are many more backend healthchecks. Looks 
like your assumption was correct..

Regards,

Marcin Deranek


On 4/12/19 4:43 PM, Marcin Deranek wrote:

Hi Emeric,

On 4/10/19 2:20 PM, Emeric Brun wrote:


On 4/10/19 1:02 PM, Marcin Deranek wrote:

Hi Emeric,

Our process limit in QAT configuration is quite high (128) and I was able to 
run 100+ openssl processes without a problem. According to Joel from Intel 
problem is in cleanup code - presumably when HAProxy exits and frees up QAT 
resources. Will try to see if I can get more debug information.


I've just take a look.

Engines deinit ar called:

haproxy/src/ssl_sock.c
#ifndef OPENSSL_NO_ENGINE
void ssl_free_engines(void) {
  struct ssl_engine_list *wl, *wlb;
  /* free up engine list */
  list_for_each_entry_safe(wl, wlb, _engines, list) {
  ENGINE_finish(wl->e);
  ENGINE_free(wl->e);
  LIST_DEL(>list);
  free(wl);
  }
}
#endif
...
#ifndef OPENSSL_NO_ENGINE
  hap_register_post_deinit(ssl_free_engines);
#endif

I don't know how many haproxy processes you are running but if I describe the 
complete scenario of processes you may note that we reach a limit:


It's very unlikely it's the limit as I lowered number of HAProxy processes 
(from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have 
problem with this limit while spawning new instances and not tearing down old 
ones. In such a case QAT would not be initialized for some HAProxy instances 
(you would see 1 thread vs 2 thread). About threads read below.


- the master sends a signal to older processes, those process will unbind and 
stop to accept new conns but continue to serve remaining sessions until the end.
- new processes are started and immediately and init the engine and accept 
newconns.
- When no more sessions remains on an old process, it calls the deinit function 
of the engine before exiting


What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) - 
looks like QAT adds extra thread to the process itself. Would adding extra 
thread possibly mess up HAProxy termination sequence ?
Our setup is to run HAProxy in multi process mode - no threads (or 1 thread per 
process if you wish).


I'm also supposed that old processes are stucked because there is some sessions 
which never ended, perhaps I'm wrong but a strace on an old process
could be interesting to know why those processes are stucked.


strace only shows these:

[pid 11392] 23:24:43.164619 epoll_wait(4,  
[pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0
[pid 11392] 23:24:43.164761 epoll_wait(4,  
[pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0
[pid 11392] 23:24:43.953286 epoll_wait(4,  
[pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0
[pid 11392] 23:24:43.953419 epoll_wait(4,  
[pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0
[pid 11392] 23:24:44.010589 epoll_wait(4,  

There are no connections: stucked process only has UDP socket on random port:

[root@externallb-124 ~]# lsof -p 6307|fgrep IPv4
hapee-lb 6307 lbengine   83u IPv4 3598779351  0t0 UDP *:19573



You can also use the 'master CLI' using '-S' and you could check if it remains 
sessions on those older processes (doc is available in management.txt)


Before reload
* systemd
  Main PID: 33515 (hapee-lb)
    Memory: 1.6G
    CGroup: /system.slice/hapee-1.8-lb.service
    ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
    ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
    ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
    ├─34860 

Re: [External] Re: QAT intermittent healthcheck errors

2019-04-12 Thread Emeric Brun
Hi Marcin,

Do you have ssl enabled on the server side? If it is the case could replace 
health check with a simple tcp check (without ssl)?

Regarding the show info/lsoff  it seems there is no more sessions on client 
side but remaining ssl jobs (CurrSslConns) and I supsect the health checks to 
miss a cleanup of their ssl sessions using the QAT. (this is just an 
assumption) 

R,
Emeric

On 4/12/19 4:43 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 4/10/19 2:20 PM, Emeric Brun wrote:
> 
>> On 4/10/19 1:02 PM, Marcin Deranek wrote:
>>> Hi Emeric,
>>>
>>> Our process limit in QAT configuration is quite high (128) and I was able 
>>> to run 100+ openssl processes without a problem. According to Joel from 
>>> Intel problem is in cleanup code - presumably when HAProxy exits and frees 
>>> up QAT resources. Will try to see if I can get more debug information.
>>
>> I've just take a look.
>>
>> Engines deinit ar called:
>>
>> haproxy/src/ssl_sock.c
>> #ifndef OPENSSL_NO_ENGINE
>> void ssl_free_engines(void) {
>>  struct ssl_engine_list *wl, *wlb;
>>  /* free up engine list */
>>  list_for_each_entry_safe(wl, wlb, _engines, list) {
>>  ENGINE_finish(wl->e);
>>  ENGINE_free(wl->e);
>>  LIST_DEL(>list);
>>  free(wl);
>>  }
>> }
>> #endif
>> ...
>> #ifndef OPENSSL_NO_ENGINE
>>  hap_register_post_deinit(ssl_free_engines);
>> #endif
>>
>> I don't know how many haproxy processes you are running but if I describe 
>> the complete scenario of processes you may note that we reach a limit:
> 
> It's very unlikely it's the limit as I lowered number of HAProxy processes 
> (from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have 
> problem with this limit while spawning new instances and not tearing down old 
> ones. In such a case QAT would not be initialized for some HAProxy instances 
> (you would see 1 thread vs 2 thread). About threads read below.
> 
>> - the master sends a signal to older processes, those process will unbind 
>> and stop to accept new conns but continue to serve remaining sessions until 
>> the end.
>> - new processes are started and immediately and init the engine and accept 
>> newconns.
>> - When no more sessions remains on an old process, it calls the deinit 
>> function of the engine before exiting
> 
> What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) - 
> looks like QAT adds extra thread to the process itself. Would adding extra 
> thread possibly mess up HAProxy termination sequence ?
> Our setup is to run HAProxy in multi process mode - no threads (or 1 thread 
> per process if you wish).
> 
>> I'm also supposed that old processes are stucked because there is some 
>> sessions which never ended, perhaps I'm wrong but a strace on an old process
>> could be interesting to know why those processes are stucked.
> 
> strace only shows these:
> 
> [pid 11392] 23:24:43.164619 epoll_wait(4,  
> [pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0
> [pid 11392] 23:24:43.164761 epoll_wait(4,  
> [pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0
> [pid 11392] 23:24:43.953286 epoll_wait(4,  
> [pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0
> [pid 11392] 23:24:43.953419 epoll_wait(4,  
> [pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0
> [pid 11392] 23:24:44.010589 epoll_wait(4,  
> 
> There are no connections: stucked process only has UDP socket on random port:
> 
> [root@externallb-124 ~]# lsof -p 6307|fgrep IPv4
> hapee-lb 6307 lbengine   83u IPv4 3598779351  0t0 UDP *:19573
> 
> 
>> You can also use the 'master CLI' using '-S' and you could check if it 
>> remains sessions on those older processes (doc is available in 
>> management.txt)
> 
> Before reload
> * systemd
>  Main PID: 33515 (hapee-lb)
>    Memory: 1.6G
>    CGroup: /system.slice/hapee-1.8-lb.service
>    ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
>    ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
>    ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
>    ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
>    └─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> * master CLI
> show proc
> #         
> 33515   master  0   0   0d 00h00m31s
> # workers
> 34858   worker  1   0   0d 00h00m31s
> 34859   worker  2   0   0d 00h00m31s
> 34860   worker  3   0   0d 00h00m31s
> 

Re: [External] Re: QAT intermittent healthcheck errors

2019-04-12 Thread Marcin Deranek

Hi Emeric,

On 4/10/19 2:20 PM, Emeric Brun wrote:


On 4/10/19 1:02 PM, Marcin Deranek wrote:

Hi Emeric,

Our process limit in QAT configuration is quite high (128) and I was able to 
run 100+ openssl processes without a problem. According to Joel from Intel 
problem is in cleanup code - presumably when HAProxy exits and frees up QAT 
resources. Will try to see if I can get more debug information.


I've just take a look.

Engines deinit ar called:

haproxy/src/ssl_sock.c
#ifndef OPENSSL_NO_ENGINE
void ssl_free_engines(void) {
 struct ssl_engine_list *wl, *wlb;
 /* free up engine list */
 list_for_each_entry_safe(wl, wlb, _engines, list) {
 ENGINE_finish(wl->e);
 ENGINE_free(wl->e);
 LIST_DEL(>list);
 free(wl);
 }
}
#endif
...
#ifndef OPENSSL_NO_ENGINE
 hap_register_post_deinit(ssl_free_engines);
#endif

I don't know how many haproxy processes you are running but if I describe the 
complete scenario of processes you may note that we reach a limit:


It's very unlikely it's the limit as I lowered number of HAProxy 
processes (from 10 to 4) while keeping QAT NumProcesses equal 32. 
HAProxy would have problem with this limit while spawning new instances 
and not tearing down old ones. In such a case QAT would not be 
initialized for some HAProxy instances (you would see 1 thread vs 2 
thread). About threads read below.



- the master sends a signal to older processes, those process will unbind and 
stop to accept new conns but continue to serve remaining sessions until the end.
- new processes are started and immediately and init the engine and accept 
newconns.
- When no more sessions remains on an old process, it calls the deinit function 
of the engine before exiting


What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) 
- looks like QAT adds extra thread to the process itself. Would adding 
extra thread possibly mess up HAProxy termination sequence ?
Our setup is to run HAProxy in multi process mode - no threads (or 1 
thread per process if you wish).



I'm also supposed that old processes are stucked because there is some sessions 
which never ended, perhaps I'm wrong but a strace on an old process
could be interesting to know why those processes are stucked.


strace only shows these:

[pid 11392] 23:24:43.164619 epoll_wait(4,  
[pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0
[pid 11392] 23:24:43.164761 epoll_wait(4,  
[pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0
[pid 11392] 23:24:43.953286 epoll_wait(4,  
[pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0
[pid 11392] 23:24:43.953419 epoll_wait(4,  
[pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0
[pid 11392] 23:24:44.010589 epoll_wait(4,  

There are no connections: stucked process only has UDP socket on random 
port:


[root@externallb-124 ~]# lsof -p 6307|fgrep IPv4
hapee-lb 6307 lbengine   83u IPv4 3598779351  0t0 
UDP *:19573




You can also use the 'master CLI' using '-S' and you could check if it remains 
sessions on those older processes (doc is available in management.txt)


Before reload
* systemd
 Main PID: 33515 (hapee-lb)
   Memory: 1.6G
   CGroup: /system.slice/hapee-1.8-lb.service
   ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
   ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
   ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
   ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
   └─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234

* master CLI
show proc
# 
33515   master  0   0   0d 00h00m31s
# workers
34858   worker  1   0   0d 00h00m31s
34859   worker  2   0   0d 00h00m31s
34860   worker  3   0   0d 00h00m31s
34861   worker  4   0   0d 00h00m31s

After reload:
* systemd
 Main PID: 33515 (hapee-lb)
   Memory: 3.1G
   CGroup: /system.slice/hapee-1.8-lb.service
   ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 
34858 34859 34860 34861 -x /run/lb_engine/process-1.sock
   ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
   ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
   ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 

Re: [External] Re: QAT intermittent healthcheck errors

2019-04-10 Thread Emeric Brun
Hi Marcin,

> You can also use the 'master CLI' using '-S' and you could check if it 
> remains sessions on those older processes (doc is available in management.txt)
Here the doc:

https://cbonte.github.io/haproxy-dconv/1.9/management.html#9.4

Emeric



Re: [External] Re: QAT intermittent healthcheck errors

2019-04-10 Thread Emeric Brun
Hi Marcin,

On 4/10/19 1:02 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> Our process limit in QAT configuration is quite high (128) and I was able to 
> run 100+ openssl processes without a problem. According to Joel from Intel 
> problem is in cleanup code - presumably when HAProxy exits and frees up QAT 
> resources. Will try to see if I can get more debug information.

I've just take a look.

Engines deinit ar called:

haproxy/src/ssl_sock.c
#ifndef OPENSSL_NO_ENGINE
void ssl_free_engines(void) {
struct ssl_engine_list *wl, *wlb;
/* free up engine list */
list_for_each_entry_safe(wl, wlb, _engines, list) {
ENGINE_finish(wl->e);
ENGINE_free(wl->e);
LIST_DEL(>list);
free(wl);
}
}
#endif
...
#ifndef OPENSSL_NO_ENGINE
hap_register_post_deinit(ssl_free_engines);
#endif

I don't know how many haproxy processes you are running but if I describe the 
complete scenario of processes you may note that we reach a limit:

- the master sends a signal to older processes, those process will unbind and 
stop to accept new conns but continue to serve remaining sessions until the end.
- new processes are started and immediately and init the engine and accept 
newconns.
- When no more sessions remains on an old process, it calls the deinit function 
of the engine before exiting

So there is a time window where you have 2x the number of processes configured 
in haproxy using the engine.

I'm also supposed that old processes are stucked because there is some sessions 
which never ended, perhaps I'm wrong but a strace on an old process
could be interesting to know why those processes are stucked.

You can also use the 'master CLI' using '-S' and you could check if it remains 
sessions on those older processes (doc is available in management.txt)


Emeric



Re: [External] Re: QAT intermittent healthcheck errors

2019-04-10 Thread Marcin Deranek

Hi Emeric,

Our process limit in QAT configuration is quite high (128) and I was 
able to run 100+ openssl processes without a problem. According to Joel 
from Intel problem is in cleanup code - presumably when HAProxy exits 
and frees up QAT resources. Will try to see if I can get more debug 
information.

Regards,

Marcin Deranek

On 4/9/19 5:17 PM, Emeric Brun wrote:

Hi Marcin,

On 4/9/19 3:07 PM, Marcin Deranek wrote:

Hi Emeric,

I have followed all instructions and I got to the point where HAProxy starts and does the job using 
QAT (backend healthchecks work and I frontend can provide content over HTTPS). The problems starts 
when HAProxy gets reloaded. With our current configuration on reload old HAProxy processes do not 
exit, so after reload you end up with 2 generations of HAProxy processes: before reload and after 
reload. I tried to find out what are conditions in which HAProxy processes get "stuck" 
and I was not able to replicate it consistently. In one case it was related to amount of backend 
servers with 'ssl' on their line, but trying to add 'ssl' to some other servers in other place had 
no effect. Interestingly in some cases for example with simple configuration (1 frontend + 1 
backend) HAProxy produced errors on reload (see attachment) - in those cases processes rarely got 
"stuck" even though errors were present.
/dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to 
get this fixed / resolved would be welcome.
Regards,

Marcin Deranek


I've checked the errors.txt and all the messages were written by the engine and 
are not part of the haproxy code. I can only do supposition for now but I think 
we face a first error due to a limitation of the amount of processes trying to 
access the engine: the reload will double the number of processes trying to 
attach the engine. Perhaps this issue can be bypassed tweaking the qat 
configuration file (some advise, from intel would be wellcome).

For the old stucked processes: I think the grow of processes also triggers 
errors on already attached ones in the qat engine but currently I ignore the 
way this errors are/should be raised to the application, it appears that they 
are currently not handled and that's why processes would be stuck (sessions may 
appear still valid for haproxy so the old process continues to wait for their 
end). We expected they were raised by the openssl API but it appears to not be 
the case. We have to check if we miss to handle an error  polling events on the 
file descriptor used to communicate with engine.


So we have to dig deeper and any help from Intel's guy or Qat aware devs will 
be appreciate.

Emeric





Re: QAT intermittent healthcheck errors

2019-04-09 Thread Emeric Brun
Hi Marcin,

On 4/9/19 3:07 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> I have followed all instructions and I got to the point where HAProxy starts 
> and does the job using QAT (backend healthchecks work and I frontend can 
> provide content over HTTPS). The problems starts when HAProxy gets reloaded. 
> With our current configuration on reload old HAProxy processes do not exit, 
> so after reload you end up with 2 generations of HAProxy processes: before 
> reload and after reload. I tried to find out what are conditions in which 
> HAProxy processes get "stuck" and I was not able to replicate it 
> consistently. In one case it was related to amount of backend servers with 
> 'ssl' on their line, but trying to add 'ssl' to some other servers in other 
> place had no effect. Interestingly in some cases for example with simple 
> configuration (1 frontend + 1 backend) HAProxy produced errors on reload (see 
> attachment) - in those cases processes rarely got "stuck" even though errors 
> were present.
> /dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to 
> get this fixed / resolved would be welcome.
> Regards,
> 
> Marcin Deranek

I've checked the errors.txt and all the messages were written by the engine and 
are not part of the haproxy code. I can only do supposition for now but I think 
we face a first error due to a limitation of the amount of processes trying to 
access the engine: the reload will double the number of processes trying to 
attach the engine. Perhaps this issue can be bypassed tweaking the qat 
configuration file (some advise, from intel would be wellcome).

For the old stucked processes: I think the grow of processes also triggers 
errors on already attached ones in the qat engine but currently I ignore the 
way this errors are/should be raised to the application, it appears that they 
are currently not handled and that's why processes would be stuck (sessions may 
appear still valid for haproxy so the old process continues to wait for their 
end). We expected they were raised by the openssl API but it appears to not be 
the case. We have to check if we miss to handle an error  polling events on the 
file descriptor used to communicate with engine.


So we have to dig deeper and any help from Intel's guy or Qat aware devs will 
be appreciate.

Emeric



Re: QAT intermittent healthcheck errors

2019-04-09 Thread Marcin Deranek

Hi Emeric,

I have followed all instructions and I got to the point where HAProxy 
starts and does the job using QAT (backend healthchecks work and I 
frontend can provide content over HTTPS). The problems starts when 
HAProxy gets reloaded. With our current configuration on reload old 
HAProxy processes do not exit, so after reload you end up with 2 
generations of HAProxy processes: before reload and after reload. I 
tried to find out what are conditions in which HAProxy processes get 
"stuck" and I was not able to replicate it consistently. In one case it 
was related to amount of backend servers with 'ssl' on their line, but 
trying to add 'ssl' to some other servers in other place had no effect. 
Interestingly in some cases for example with simple configuration (1 
frontend + 1 backend) HAProxy produced errors on reload (see attachment) 
- in those cases processes rarely got "stuck" even though errors were 
present.
/dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any 
help to get this fixed / resolved would be welcome.

Regards,

Marcin Deranek

On 3/13/19 12:04 PM, Emeric Brun wrote:

Hi Marcin,

On 3/11/19 4:27 PM, Marcin Deranek wrote:

On 3/11/19 11:51 AM, Emeric Brun wrote:


Mode async is enabled on both sides, server and frontend side.

But on server side, haproxy is using session resuming, so there is a new key 
computation (full handshake with RSA/DSA computation) only every 5 minutes 
(openssl default value).

You can force to recompute each time setting "no-ssl-reuse" on server line, but 
it will add a heavy load for ssl computation on the server.


Indeed, setting no-ssl-reuse makes use of QAT for healthchecks.
Looks like finally we are ready for QAT testing.
Thank you Emeric.
Regards,

Marcin Deranek




I've just re-check and i think you should also enable the 'PKEY_CRYPTO' algo to 
the engine

ssl-engine qat algo RSA,DSA,EC,DH,PKEY_CRYPTO

It will enable rhe offloading of the TLS1-PRF you can see there:

# /opt/booking-openssl/bin/openssl engine -c qat
(qat) Reference implementation of QAT crypto engine
  [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, 
AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF]

R,
Emeric

2019-04-09T14:22:45.523342+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #1 (61249) forked
2019-04-09T14:22:45.523368+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #2 (61250) forked
2019-04-09T14:22:45.523393+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #3 (61251) forked
2019-04-09T14:22:45.523418+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #4 (61252) forked
2019-04-09T14:22:45.523444+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #5 (61253) forked
2019-04-09T14:22:45.523469+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #6 (61255) forked
2019-04-09T14:22:45.523493+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #7 (61258) forked
2019-04-09T14:22:45.523518+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #8 (61259) forked
2019-04-09T14:22:45.523548+02:00 externallb hapee-lb[60816]: [NOTICE] 
098/142244 (60816) : New worker #9 (61261) forked
2019-04-09T14:22:45.523596+02:00 externallb hapee-lb[60816]: [error] 
cpaCyStopInstance() - : Can not get instance info
2019-04-09T14:22:45.523623+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF
2019-04-09T14:22:45.523649+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_ServiceEventHandler() - : Failed to get enabled services
2019-04-09T14:22:45.523674+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF
2019-04-09T14:22:45.523699+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_ServiceEventHandler() - : Failed to get enabled services
2019-04-09T14:22:45.523724+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF
2019-04-09T14:22:45.523749+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_ServiceEventHandler() - : Failed to get enabled services
2019-04-09T14:22:45.523774+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF
2019-04-09T14:22:45.523799+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_ServiceEventHandler() - : Failed to get enabled services
2019-04-09T14:22:45.523823+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF
2019-04-09T14:22:45.523848+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_ServiceEventHandler() - : Failed to get enabled services
2019-04-09T14:22:45.523874+02:00 externallb hapee-lb[60816]: [error] 
SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF
2019-04-09T14:22:45.523899+02:00 

Re: [External] Re: QAT intermittent healthcheck errors

2019-03-13 Thread Emeric Brun
Hi Marcin,

On 3/11/19 4:27 PM, Marcin Deranek wrote:
> On 3/11/19 11:51 AM, Emeric Brun wrote:
> 
>> Mode async is enabled on both sides, server and frontend side.
>>
>> But on server side, haproxy is using session resuming, so there is a new key 
>> computation (full handshake with RSA/DSA computation) only every 5 minutes 
>> (openssl default value).
>>
>> You can force to recompute each time setting "no-ssl-reuse" on server line, 
>> but it will add a heavy load for ssl computation on the server.
> 
> Indeed, setting no-ssl-reuse makes use of QAT for healthchecks.
> Looks like finally we are ready for QAT testing.
> Thank you Emeric.
> Regards,
> 
> Marcin Deranek
> 


I've just re-check and i think you should also enable the 'PKEY_CRYPTO' algo to 
the engine

ssl-engine qat algo RSA,DSA,EC,DH,PKEY_CRYPTO

It will enable rhe offloading of the TLS1-PRF you can see there:

# /opt/booking-openssl/bin/openssl engine -c qat
(qat) Reference implementation of QAT crypto engine
 [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, 
AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF]

R,
Emeric



Re: [External] Re: QAT intermittent healthcheck errors

2019-03-11 Thread Marcin Deranek

Hi Emeric,

On 3/11/19 2:48 PM, Emeric Brun wrote:


Once again, you could add the "no-ssl-reuse" statement if you want to check if 
QAT offloads the backend side, but it is clearly not an optimal option for production 
because it will generate an heavy load
on your servers and force them to recompute keys for each connections.
I just wanted to make sure that QAT is involved in both and does what it 
suppose to do based on data rather than hope or trust :-))
We won't be running it with no-ssl-reuse as for obvious reasons we don't 
want to make more load than necessary.

Thank you once again for your help.
Regards,

Marcin Deranek



Re: [External] Re: QAT intermittent healthcheck errors

2019-03-11 Thread Marcin Deranek

On 3/11/19 11:51 AM, Emeric Brun wrote:


Mode async is enabled on both sides, server and frontend side.

But on server side, haproxy is using session resuming, so there is a new key 
computation (full handshake with RSA/DSA computation) only every 5 minutes 
(openssl default value).

You can force to recompute each time setting "no-ssl-reuse" on server line, but 
it will add a heavy load for ssl computation on the server.


Indeed, setting no-ssl-reuse makes use of QAT for healthchecks.
Looks like finally we are ready for QAT testing.
Thank you Emeric.
Regards,

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-03-11 Thread Emeric Brun
On 3/11/19 11:51 AM, Emeric Brun wrote:
> On 3/11/19 11:06 AM, Marcin Deranek wrote:
>> Hi Emeric,
>>
>> On 3/8/19 11:24 AM, Emeric Brun wrote:
>>> Are you sure that servers won't use ECDSA certificates? Do you check that 
>>> conn are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384'
>>
>> Backend servers only support TLS 1.2 and RSA certificates.
>>
>>> Could you check algo supported by QAT doing this ?:
>>> openssl  engine -c qat
>>
>> # /opt/booking-openssl/bin/openssl engine -c qat
>> (qat) Reference implementation of QAT crypto engine
>>  [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, 
>> AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF]
>>
>>> Could you retry with this config:
>>> ssl-engine qat algo RSA,DSA,EC,DH
>>
>> Just did that and experienced the very same effect: no QAT activity for 
>> backend server healthchecks :-( When I add frontend eg.
>>
>> frontend frontend1
>>     bind 127.0.0.1:8443 ssl crt 
>> /etc/lb_engine/data/generated/ssl/10.252.24.7:443
>>     default_backend pool_all
>>
>> and make some connections/requests (TLS1.2 and/or TLS/1.3) to the frontend I 
>> see QAT activity, but *NO* activity when HAProxy is "idle" (only doing 
>> healthchecks to backend servers: TLS 1.2 only).
>> This feels like healthchecks are not passing through QAT engine for whatever 
>> reason :-( Even enabling HTTP check for the backend (option httpchk) does 
>> not make any difference.
>> The question: Is SSL Async Mode actually supported on the backend side 
>> (either healthchecks and/or normal traffic) ?
>> Regards,
> 
> Mode async is enabled on both sides, server and frontend side.
> 
> But on server side, haproxy is using session resuming, so there is a new key 
> computation (full handshake with RSA/DSA computation) only every 5 minutes 
> (openssl default value).
> 
> You can force to recompute each time setting "no-ssl-reuse" on server line, 
> but it will add a heavy load for ssl computation on the server.
> 
> 
> R,
> Emeric
> 

I've just realized that what you observe is the expected behavior: QAT offloads 
on the frontend side, and this is what we want: to offload on QAT the heavy 
load of key computing on the frontend side (the
support of async engines in haproxy was added for this reason).

On the backend side, haproxy acts as a client, re-using session and even if a 
key is re-computed by the server, the cost of processing on the haproxy's 
backend side is much lower compared to frontend side,
perhaps it is not even implemented into QAT.

Once again, you could add the "no-ssl-reuse" statement if you want to check if 
QAT offloads the backend side, but it is clearly not an optimal option for 
production because it will generate an heavy load
on your servers and force them to recompute keys for each connections.

R,
Emeric



Re: QAT intermittent healthcheck errors

2019-03-11 Thread Emeric Brun
On 3/11/19 11:06 AM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 3/8/19 11:24 AM, Emeric Brun wrote:
>> Are you sure that servers won't use ECDSA certificates? Do you check that 
>> conn are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384'
> 
> Backend servers only support TLS 1.2 and RSA certificates.
> 
>> Could you check algo supported by QAT doing this ?:
>> openssl  engine -c qat
> 
> # /opt/booking-openssl/bin/openssl engine -c qat
> (qat) Reference implementation of QAT crypto engine
>  [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, 
> AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF]
> 
>> Could you retry with this config:
>> ssl-engine qat algo RSA,DSA,EC,DH
> 
> Just did that and experienced the very same effect: no QAT activity for 
> backend server healthchecks :-( When I add frontend eg.
> 
> frontend frontend1
>     bind 127.0.0.1:8443 ssl crt 
> /etc/lb_engine/data/generated/ssl/10.252.24.7:443
>     default_backend pool_all
> 
> and make some connections/requests (TLS1.2 and/or TLS/1.3) to the frontend I 
> see QAT activity, but *NO* activity when HAProxy is "idle" (only doing 
> healthchecks to backend servers: TLS 1.2 only).
> This feels like healthchecks are not passing through QAT engine for whatever 
> reason :-( Even enabling HTTP check for the backend (option httpchk) does not 
> make any difference.
> The question: Is SSL Async Mode actually supported on the backend side 
> (either healthchecks and/or normal traffic) ?
> Regards,

Mode async is enabled on both sides, server and frontend side.

But on server side, haproxy is using session resuming, so there is a new key 
computation (full handshake with RSA/DSA computation) only every 5 minutes 
(openssl default value).

You can force to recompute each time setting "no-ssl-reuse" on server line, but 
it will add a heavy load for ssl computation on the server.


R,
Emeric



Re: QAT intermittent healthcheck errors

2019-03-11 Thread Marcin Deranek

Hi Emeric,

On 3/8/19 11:24 AM, Emeric Brun wrote:

Are you sure that servers won't use ECDSA certificates? Do you check that conn 
are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384'


Backend servers only support TLS 1.2 and RSA certificates.


Could you check algo supported by QAT doing this ?:
openssl  engine -c qat


# /opt/booking-openssl/bin/openssl engine -c qat
(qat) Reference implementation of QAT crypto engine
 [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, 
AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF]



Could you retry with this config:
ssl-engine qat algo RSA,DSA,EC,DH


Just did that and experienced the very same effect: no QAT activity for 
backend server healthchecks :-( When I add frontend eg.


frontend frontend1
bind 127.0.0.1:8443 ssl crt 
/etc/lb_engine/data/generated/ssl/10.252.24.7:443

default_backend pool_all

and make some connections/requests (TLS1.2 and/or TLS/1.3) to the 
frontend I see QAT activity, but *NO* activity when HAProxy is "idle" 
(only doing healthchecks to backend servers: TLS 1.2 only).
This feels like healthchecks are not passing through QAT engine for 
whatever reason :-( Even enabling HTTP check for the backend (option 
httpchk) does not make any difference.
The question: Is SSL Async Mode actually supported on the backend side 
(either healthchecks and/or normal traffic) ?

Regards,

Marcin Deranek



Re: [External] Re: QAT intermittent healthcheck errors

2019-03-11 Thread Marcin Deranek

Hi Emeric,

On 3/8/19 4:43 PM, Emeric Brun wrote:


I've just realized that if your server are TLSv1.3  ssl-default-server-ciphers
won't force anything (see ssl-default-server-ciphersuites documentation)


Backend servers are 'only' TLS 1.2, so it should have desired effect.
Will test suggested configuration changes and report shortly.

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-03-08 Thread Emeric Brun
Hi Marcin,

On 3/7/19 6:43 PM, Marcin Deranek wrote:
> Hi,
> 
> On 3/6/19 6:36 PM, Emeric Brun wrote:
>> According to the documentation:
>>
>> ssl-mode-async
>>    Adds SSL_MODE_ASYNC mode to the SSL context. This enables asynchronous TLS
>>    I/O operations if asynchronous capable SSL engines are used. The current
>>    implementation supports a maximum of 32 engines. The Openssl ASYNC API
>>    doesn't support moving read/write buffers and is not compliant with
>>    haproxy's buffer management. So the asynchronous mode is disabled on
>>    read/write  operations (it is only enabled during initial and reneg
>>    handshakes).
>>
>> Asynchronous mode is disabled on the read/write operation and is only 
>> enabled during handshake.
>>
>> It means that for the ciphering process the engine will be used in blocking 
>> mode (not async) which could result to
>> unpredictable behavior on timers because the haproxy process will 
>> sporadically fully blocked waiting for the engine.
>>
>> To avoid this issue, you should ensure to use QAT only for the asymmetric 
>> computing algorithm (such as RSA DSA ECDSA).
>> and not for ciphering ones (AES and everything else ...)
> 
> I did explicitly enabled RSA algos:
> 
> ssl-engine qat algo RSA
> 
> and errors were gone at that point. Unfortunately all QAT activity too as
> 
> /sys/kernel/debug/qat_c6xx_\:0*/fw_counters
> 
> were reporting identical values (previously they were incrementing).
> 
> I did explicitly enforce RSA:
> 
> ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA384

I've just realized that if your server are TLSv1.3  ssl-default-server-ciphers
won't force anything (see ssl-default-server-ciphersuites documentation)

R,
Emeric



Re: QAT intermittent healthcheck errors

2019-03-08 Thread Emeric Brun


Hi Marcin,

On 3/7/19 6:43 PM, Marcin Deranek wrote:
> Hi,
> 
> On 3/6/19 6:36 PM, Emeric Brun wrote:
>> According to the documentation:
>>
>> ssl-mode-async
>>    Adds SSL_MODE_ASYNC mode to the SSL context. This enables asynchronous TLS
>>    I/O operations if asynchronous capable SSL engines are used. The current
>>    implementation supports a maximum of 32 engines. The Openssl ASYNC API
>>    doesn't support moving read/write buffers and is not compliant with
>>    haproxy's buffer management. So the asynchronous mode is disabled on
>>    read/write  operations (it is only enabled during initial and reneg
>>    handshakes).
>>
>> Asynchronous mode is disabled on the read/write operation and is only 
>> enabled during handshake.
>>
>> It means that for the ciphering process the engine will be used in blocking 
>> mode (not async) which could result to
>> unpredictable behavior on timers because the haproxy process will 
>> sporadically fully blocked waiting for the engine.
>>
>> To avoid this issue, you should ensure to use QAT only for the asymmetric 
>> computing algorithm (such as RSA DSA ECDSA).
>> and not for ciphering ones (AES and everything else ...)
> 
> I did explicitly enabled RSA algos:
> 
> ssl-engine qat algo RSA
> 
> and errors were gone at that point. Unfortunately all QAT activity too as
> 
> /sys/kernel/debug/qat_c6xx_\:0*/fw_counters
> 
> were reporting identical values (previously they were incrementing).
> 
> I did explicitly enforce RSA:
> 
> ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA384
> 
> but that did not help. Do I miss something ?
> Regards,
> 
> Marcin Deranek
> 

Are you sure that servers won't use ECDSA certificates? Do you check that conn 
are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384'

Could you check algo supported by QAT doing this ?:
openssl  engine -c qat

Could you retry with this config:
ssl-engine qat algo RSA,DSA,EC,DH


R,
Emeric





Re: QAT intermittent healthcheck errors

2019-03-07 Thread Marcin Deranek

Hi,

On 3/6/19 6:36 PM, Emeric Brun wrote:

According to the documentation:

ssl-mode-async
   Adds SSL_MODE_ASYNC mode to the SSL context. This enables asynchronous TLS
   I/O operations if asynchronous capable SSL engines are used. The current
   implementation supports a maximum of 32 engines. The Openssl ASYNC API
   doesn't support moving read/write buffers and is not compliant with
   haproxy's buffer management. So the asynchronous mode is disabled on
   read/write  operations (it is only enabled during initial and reneg
   handshakes).

Asynchronous mode is disabled on the read/write operation and is only enabled 
during handshake.

It means that for the ciphering process the engine will be used in blocking 
mode (not async) which could result to
unpredictable behavior on timers because the haproxy process will sporadically 
fully blocked waiting for the engine.

To avoid this issue, you should ensure to use QAT only for the asymmetric 
computing algorithm (such as RSA DSA ECDSA).
and not for ciphering ones (AES and everything else ...)


I did explicitly enabled RSA algos:

ssl-engine qat algo RSA

and errors were gone at that point. Unfortunately all QAT activity too as

/sys/kernel/debug/qat_c6xx_\:0*/fw_counters

were reporting identical values (previously they were incrementing).

I did explicitly enforce RSA:

ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA384

but that did not help. Do I miss something ?
Regards,

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-03-06 Thread Marcin Deranek

Hi,

On 3/6/19 6:36 PM, Emeric Brun wrote:


To avoid this issue, you should ensure to use QAT only for the asymmetric 
computing algorithm (such as RSA DSA ECDSA).
and not for ciphering ones (AES and everything else ...)

The ssl engine statement allow you to filter such algos:

ssl-engine  [algo ]


I'm pretty sure I tried this, but I will try to re-test again with eg. 
RSA specified and see if that makes any difference.

Regards,

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-03-06 Thread Emeric Brun
Hi Marcin,

On 3/6/19 3:23 PM, Marcin Deranek wrote:
> Hi,
> 
> In a process of evaluating performance of Intel Quick Assist Technology in 
> conjunction with HAProxy software I acquired Intel C62x Chipset card for 
> testing. I configured QAT engine in the following manner:
> 
> * /etc/qat/c6xx_dev[012].conf
> 
> [GENERAL]
> ServicesEnabled = cy
> ConfigVersion = 2
> CyNumConcurrentSymRequests = 512
> CyNumConcurrentAsymRequests = 64
> statsGeneral = 1
> statsDh = 1
> statsDrbg = 1
> statsDsa = 1
> statsEcc = 1
> statsKeyGen = 1
> statsDc = 1
> statsLn = 1
> statsPrime = 1
> statsRsa = 1
> statsSym = 1
> KptEnabled = 0
> StorageEnabled = 0
> PkeServiceDisabled = 0
> DcIntermediateBufferSizeInKB = 64
> 
> [KERNEL]
> NumberCyInstances = 0
> NumberDcInstances = 0
> 
> [SHIM]
> NumberCyInstances = 1
> NumberDcInstances = 0
> NumProcesses = 16
> LimitDevAccess = 0
> 
> Cy0Name = "UserCY0"
> Cy0IsPolled = 1
> Cy0CoreAffinity = 0
> 
> OpenSSL produces good results without warnings / errors:
> 
> * No QAT involved
> 
> $ openssl speed -elapsed rsa2048
> You have chosen to measure elapsed time instead of user CPU time.
> Doing 2048 bits private rsa's for 10s: 10858 2048 bits private RSA's in 10.00s
> Doing 2048 bits public rsa's for 10s: 361207 2048 bits public RSA's in 10.00s
> OpenSSL 1.1.1a FIPS  20 Nov 2018
> built on: Tue Jan 22 20:43:41 2019 UTC
> options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) 
> blowfish(ptr)
> compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g -pipe 
> -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
> --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic 
> -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC 
> -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT 
> -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM 
> -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM 
> -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM 
> -DPOLY1305_ASM -DZLIB -DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" 
> -DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config"
>   sign    verify    sign/s verify/s
> rsa 2048 bits 0.000921s 0.28s   1085.8  36120.7
> 
> * QAT enabled
> 
> $ openssl speed -elapsed -engine qat -async_jobs 32 rsa2048
> engine "qat" set.
> You have chosen to measure elapsed time instead of user CPU time.
> Doing 2048 bits private rsa's for 10s: 205425 2048 bits private RSA's in 
> 10.00s
> Doing 2048 bits public rsa's for 10s: 2150270 2048 bits public RSA's in 10.00s
> OpenSSL 1.1.1a FIPS  20 Nov 2018
> built on: Tue Jan 22 20:43:41 2019 UTC
> options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) 
> blowfish(ptr)
> compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g -pipe 
> -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
> --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic 
> -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC 
> -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT 
> -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM 
> -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM 
> -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM 
> -DPOLY1305_ASM -DZLIB -DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" 
> -DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config"
>   sign    verify    sign/s verify/s
> rsa 2048 bits 0.49s 0.05s  20542.5 215027.0
> 
> So far so good. Unfortunately HAProxy 1.8 iwth QAT engine enabled 
> periodically fail with SSL checks of backend servers. The simplest 
> configuration I could get to reproduce it:
> 
> * /etc/haproxy/haproxy.cfg
> 
> global
>     user lbengine
>     group lbengine
>     daemon
>     ssl-mode-async
>     ssl-engine qat
>     ssl-server-verify none
>     stats   socket /run/lb_engine/process-1.sock user lbengine group 
> lbengine mode 660 level admin expose-fd listeners process 1
> 
> defaults
>     mode http
>     timeout check 5s
>     timeout connect 4s
> 
> backend pool_all
>     default-server inter 5s
> 
>     server server1 ip1:443 check ssl
>     server server2 ip2:443 check ssl
>     ...
>     server serverN ipN:443 check ssl
> 
> Without QAT enabled everything works just fine - healthchecks do not flap. 
> With QAT engine enabled random server healtchecks flap: they fail and then 
> shortly after they recover eg.
> 
> 2019-03-06T15:06:22+01:00 localhost hapee-lb[1832]: Server pool_all/server1 
> is DOWN, reason: Layer6 timeout, check duration: 4000ms. 110 active and 0 
> backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> 2019-03-06T15:06:32+01:00 localhost hapee-lb[1832]: Server pool_all/server1 
> is UP, reason: Layer6 check passed, check duration: 13ms. 117 active and 0 
> backup servers 

QAT intermittent healthcheck errors

2019-03-06 Thread Marcin Deranek

Hi,

In a process of evaluating performance of Intel Quick Assist Technology 
in conjunction with HAProxy software I acquired Intel C62x Chipset card 
for testing. I configured QAT engine in the following manner:


* /etc/qat/c6xx_dev[012].conf

[GENERAL]
ServicesEnabled = cy
ConfigVersion = 2
CyNumConcurrentSymRequests = 512
CyNumConcurrentAsymRequests = 64
statsGeneral = 1
statsDh = 1
statsDrbg = 1
statsDsa = 1
statsEcc = 1
statsKeyGen = 1
statsDc = 1
statsLn = 1
statsPrime = 1
statsRsa = 1
statsSym = 1
KptEnabled = 0
StorageEnabled = 0
PkeServiceDisabled = 0
DcIntermediateBufferSizeInKB = 64

[KERNEL]
NumberCyInstances = 0
NumberDcInstances = 0

[SHIM]
NumberCyInstances = 1
NumberDcInstances = 0
NumProcesses = 16
LimitDevAccess = 0

Cy0Name = "UserCY0"
Cy0IsPolled = 1
Cy0CoreAffinity = 0

OpenSSL produces good results without warnings / errors:

* No QAT involved

$ openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 10858 2048 bits private RSA's in 
10.00s
Doing 2048 bits public rsa's for 10s: 361207 2048 bits public RSA's in 
10.00s

OpenSSL 1.1.1a FIPS  20 Nov 2018
built on: Tue Jan 22 20:43:41 2019 UTC
options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) 
blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g 
-pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches 
-m64 -mtune=generic -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN 
-DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 
-DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m 
-DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM 
-DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM 
-DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DZLIB 
-DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" 
-DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config"

  signverifysign/s verify/s
rsa 2048 bits 0.000921s 0.28s   1085.8  36120.7

* QAT enabled

$ openssl speed -elapsed -engine qat -async_jobs 32 rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 205425 2048 bits private RSA's in 
10.00s
Doing 2048 bits public rsa's for 10s: 2150270 2048 bits public RSA's in 
10.00s

OpenSSL 1.1.1a FIPS  20 Nov 2018
built on: Tue Jan 22 20:43:41 2019 UTC
options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) 
blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g 
-pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches 
-m64 -mtune=generic -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN 
-DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 
-DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m 
-DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM 
-DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM 
-DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DZLIB 
-DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" 
-DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config"

  signverifysign/s verify/s
rsa 2048 bits 0.49s 0.05s  20542.5 215027.0

So far so good. Unfortunately HAProxy 1.8 iwth QAT engine enabled 
periodically fail with SSL checks of backend servers. The simplest 
configuration I could get to reproduce it:


* /etc/haproxy/haproxy.cfg

global
user lbengine
group lbengine
daemon
ssl-mode-async
ssl-engine qat
ssl-server-verify none
stats   socket /run/lb_engine/process-1.sock user lbengine 
group lbengine mode 660 level admin expose-fd listeners process 1


defaults
mode http
timeout check 5s
timeout connect 4s

backend pool_all
default-server inter 5s

server server1 ip1:443 check ssl
server server2 ip2:443 check ssl
...
server serverN ipN:443 check ssl

Without QAT enabled everything works just fine - healthchecks do not 
flap. With QAT engine enabled random server healtchecks flap: they fail 
and then shortly after they recover eg.


2019-03-06T15:06:22+01:00 localhost hapee-lb[1832]: Server 
pool_all/server1 is DOWN, reason: Layer6 timeout, check duration: 
4000ms. 110 active and 0 backup servers left. 0 sessions active, 0 
requeued, 0 remaining in queue.
2019-03-06T15:06:32+01:00 localhost hapee-lb[1832]: Server 
pool_all/server1 is UP, reason: Layer6 check passed, check duration: 
13ms. 117 active and 0 backup servers online. 0 sessions requeued, 0 
total in queue.


Increasing check frequency (lowering check interval) makes the problem 
occur more frequently. Anybody has a clue why this is happening ? Has 
anybody seen such behavior ?

Regards,

Marcin Deranek