Re: QAT intermittent healthcheck errors
Hi Emeric, On 5/13/19 11:06 AM, Emeric Brun wrote: Just to known that I'm waiting for the feedback of intel's team and I will receive QAT 1.7 compliant hardware soon to make some tests here. Thank you for an update. Regards, Marcin Deranek
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, > > Thank you Marcin, It shows that haproxy is waiting for an event on all those > fds because a crypto jobs were launched on the engine > and we can't free the session until the end of this job (it would result in a > segfault). > > So the processes are stucked, unable to free the session because the engine > doesn't signal the end of those job via the async fd. > > I didn't reproduce this issue on QAT 1.5 so I will try to discuss it with > intel guys to known why there is this behavior change in the v1.7 > and what we can do. > > R, > Emeric > Just to known that I'm waiting for the feedback of intel's team and I will receive QAT 1.7 compliant hardware soon to make some tests here. R, Emeric
Re: [External] Re: QAT intermittent healthcheck errors
On 5/7/19 3:35 PM, Marcin Deranek wrote: > Hi Emeric, > > On 5/7/19 1:53 PM, Emeric Brun wrote: >> On 5/7/19 1:24 PM, Marcin Deranek wrote: >>> Hi Emeric, >>> >>> On 5/7/19 11:44 AM, Emeric Brun wrote: Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. >>> Regards, >> >> Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. >>> >>> It did fix fd leak: >>> >>> # ls -al /proc/2565/fd|fgrep dev >>> lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null >>> lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv >>> >>> # systemctl reload haproxy.service >>> # ls -al /proc/2565/fd|fgrep dev >>> lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null >>> lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv >>> >>> # systemctl reload haproxy.service >>> # ls -al /proc/2565/fd|fgrep dev >>> lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null >>> lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv >>> >>> But there are still stuck processes :-( This is with both patches included: >>> for QAT and HAProxy. >>> Regards, >>> >>> Marcin Deranek >> >> Thank you Marcin! Anyway it's was also a bug. >> >> Could you process a 'show fds' command on a stucked process adding the patch >> in attachement. > > I did apply this patch and all previous patches (QAT + HAProxy > ssl_free_engine). This is what I got after 1st reload: > > show proc > # > 8025 master 0 1 0d 00h03m25s > # workers > 31269 worker 1 0 0d 00h00m39s > 31270 worker 2 0 0d 00h00m39s > 31271 worker 3 0 0d 00h00m39s > 31272 worker 4 0 0d 00h00m39s > # old workers > 9286 worker [was: 1] 1 0d 00h03m25s > 9287 worker [was: 2] 1 0d 00h03m25s > 9288 worker [was: 3] 1 0d 00h03m25s > 9289 worker [was: 4] 1 0d 00h03m25s > > @!9286 show fd > 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x23eaae0 > iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0 > 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e1ab0 > iocb=0x4e1ab0(thread_sync_io_handler) tmask=0x umask=0x0 > 20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1601b840 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x1f0ec4f0 > iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL > mux=PASS mux_ctx=0x22ad8630 > 1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bab1f30 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x247e5bc0 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x18883650 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x14476c10 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11a27850 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12008230 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bb0a570 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c94790 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1449e050 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1f00c150 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x15f40550 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x124b6340 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11fe4500 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c70a60 >
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 5/7/19 1:53 PM, Emeric Brun wrote: On 5/7/19 1:24 PM, Marcin Deranek wrote: Hi Emeric, On 5/7/19 11:44 AM, Emeric Brun wrote: Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. Regards, Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. It did fix fd leak: # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv But there are still stuck processes :-( This is with both patches included: for QAT and HAProxy. Regards, Marcin Deranek Thank you Marcin! Anyway it's was also a bug. Could you process a 'show fds' command on a stucked process adding the patch in attachement. I did apply this patch and all previous patches (QAT + HAProxy ssl_free_engine). This is what I got after 1st reload: show proc # 8025master 0 1 0d 00h03m25s # workers 31269 worker 1 0 0d 00h00m39s 31270 worker 2 0 0d 00h00m39s 31271 worker 3 0 0d 00h00m39s 31272 worker 4 0 0d 00h00m39s # old workers 9286worker [was: 1]1 0d 00h03m25s 9287worker [was: 2]1 0d 00h03m25s 9288worker [was: 3]1 0d 00h03m25s 9289worker [was: 4]1 0d 00h03m25s @!9286 show fd 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x23eaae0 iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e1ab0 iocb=0x4e1ab0(thread_sync_io_handler) tmask=0x umask=0x0 20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1601b840 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x1f0ec4f0 iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x22ad8630 1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bab1f30 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x247e5bc0 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x18883650 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x14476c10 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11a27850 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12008230 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bb0a570 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c94790 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1449e050 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1f00c150 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x15f40550 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x124b6340 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11fe4500 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c70a60 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1427 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12572540 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1428 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1249a420 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1430 : st=0x05(R:PrA
Re: [External] Re: QAT intermittent healthcheck errors
On 5/7/19 1:24 PM, Marcin Deranek wrote: > Hi Emeric, > > On 5/7/19 11:44 AM, Emeric Brun wrote: >> Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see >> attachment for end result). Unfortunately after applying the patch there is >> no change in behavior: we still leak /dev/usdm_drv descriptors and have >> "stuck" HAProxy instances after reload.. > Regards, >> >> Could you perform a test recompiling the usdm_drv and the engine with this >> patch, it applies on QAT 1.7 but I've no hardware to test this version here. >> >> It should fix the fd leak. > > It did fix fd leak: > > # ls -al /proc/2565/fd|fgrep dev > lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null > lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv > > # systemctl reload haproxy.service > # ls -al /proc/2565/fd|fgrep dev > lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null > lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv > > # systemctl reload haproxy.service > # ls -al /proc/2565/fd|fgrep dev > lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null > lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv > > But there are still stuck processes :-( This is with both patches included: > for QAT and HAProxy. > Regards, > > Marcin Deranek Thank you Marcin! Anyway it's was also a bug. Could you process a 'show fds' command on a stucked process adding the patch in attachement. R, Emeric >From d0e095c2aa54f020de8fc50db867eff1ef73350e Mon Sep 17 00:00:00 2001 From: Emeric Brun Date: Fri, 19 Apr 2019 17:15:28 +0200 Subject: [PATCH] MINOR: ssl/cli: async fd io-handlers printable on show fd This patch exports the async fd iohandlers and make them printable doing a 'show fd' on cli. --- include/proto/ssl_sock.h | 4 src/cli.c| 9 + src/ssl_sock.c | 4 ++-- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/proto/ssl_sock.h b/include/proto/ssl_sock.h index 62ebcb87..ce52fb74 100644 --- a/include/proto/ssl_sock.h +++ b/include/proto/ssl_sock.h @@ -85,6 +85,10 @@ SSL_CTX *ssl_sock_get_generated_cert(unsigned int key, struct bind_conf *bind_co int ssl_sock_set_generated_cert(SSL_CTX *ctx, unsigned int key, struct bind_conf *bind_conf); unsigned int ssl_sock_generated_cert_key(const void *data, size_t len); +#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC) +void ssl_async_fd_handler(int fd); +void ssl_async_fd_free(int fd); +#endif /* ssl shctx macro */ diff --git a/src/cli.c b/src/cli.c index 568ceba2..843c3d04 100644 --- a/src/cli.c +++ b/src/cli.c @@ -69,6 +69,9 @@ #include #include #include +#ifdef USE_OPENSSL +#include +#endif #define PAYLOAD_PATTERN "<<" @@ -998,6 +1001,12 @@ static int cli_io_handler_show_fd(struct appctx *appctx) (fdt.iocb == listener_accept) ? "listener_accept" : (fdt.iocb == poller_pipe_io_handler) ? "poller_pipe_io_handler" : (fdt.iocb == mworker_accept_wrapper) ? "mworker_accept_wrapper" : +#ifdef USE_OPENSSL +#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC) + (fdt.iocb == ssl_async_fd_free) ? "ssl_async_fd_free" : + (fdt.iocb == ssl_async_fd_handler) ? "ssl_async_fd_handler" : +#endif +#endif "unknown"); if (fdt.iocb == conn_fd_handler) { diff --git a/src/ssl_sock.c b/src/ssl_sock.c index 112520c8..58ae8a26 100644 --- a/src/ssl_sock.c +++ b/src/ssl_sock.c @@ -573,7 +573,7 @@ fail_get: /* * openssl async fd handler */ -static void ssl_async_fd_handler(int fd) +void ssl_async_fd_handler(int fd) { struct connection *conn = fdtab[fd].owner; @@ -594,7 +594,7 @@ static void ssl_async_fd_handler(int fd) /* * openssl async delayed SSL_free handler */ -static void ssl_async_fd_free(int fd) +void ssl_async_fd_free(int fd) { SSL *ssl = fdtab[fd].owner; OSSL_ASYNC_FD all_fd[32]; -- 2.17.1
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 5/7/19 11:44 AM, Emeric Brun wrote: Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. Regards, Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. It did fix fd leak: # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv But there are still stuck processes :-( This is with both patches included: for QAT and HAProxy. Regards, Marcin Deranek
Re: QAT intermittent healthcheck errors
On 5/7/19 11:44 AM, Emeric Brun wrote: Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. Will do and report back. Marcin Deranek
Re: QAT intermittent healthcheck errors
Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. >>> Regards, >> >> Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. R, Emeric diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c --- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c 2019-05-07 11:35:15.654202291 +0200 +++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c 2019-05-07 11:35:44.302292417 +0200 @@ -104,7 +104,7 @@ /* standard page size */ page_size = getpagesize(); -fd = qae_open("/proc/self/pagemap", O_RDONLY); +fd = qae_open("/proc/self/pagemap", O_RDONLY|O_CLOEXEC); if (fd < 0) { return 0; diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c --- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c 2019-03-15 15:23:43.0 +0100 +++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c 2019-05-07 11:24:08.755921241 +0200 @@ -745,7 +745,7 @@ if (fd > 0) close(fd); -fd = qae_open(QAE_MEM, O_RDWR); +fd = qae_open(QAE_MEM, O_RDWR|O_CLOEXEC); if (fd < 0) { CMD_ERROR("%s:%d Unable to initialize memory file handle %s \n",
Re: QAT intermittent healthcheck errors
Hi Marcin, On 5/6/19 3:31 PM, Emeric Brun wrote: > Hi Marcin, > > On 5/6/19 3:15 PM, Marcin Deranek wrote: >> Hi Emeric, >> >> On 5/3/19 5:54 PM, Emeric Brun wrote: >>> Hi Marcin, >>> >>> On 5/3/19 4:56 PM, Marcin Deranek wrote: Hi Emeric, On 5/3/19 4:50 PM, Emeric Brun wrote: > I've a testing platform here but I don't use the usdm_drv but the > qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as > the doc says to use with my chip) . I see. I use qat 1.7 and qat-engine 0.5.40. > Anyway, could you re-compile a haproxy's binary if I provide you a > testing patch? Sure, that should not be a problem. >>> >>> The patch in attachment. >> >> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end >> result). Unfortunately after applying the patch there is no change in >> behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy >> instances after reload.. >> Regards, > > > Ok, the patch adds a ENGINE_finish() call before the reload. I was supposing > that the ENGINE_finish would perform the close of the fd because on the > application side there is no different way to interact with the engine. > > Unfortunately, this is not the case. So I will ask to the intel guys what we > supposed to do to close this fd. I've just written to my contact at intel. Just for note: I'm using hardware supported with QAT 1.5 in this version tu usdm_drv was not present and I use the other option qat_contig_mem which seems to not cause such fd leak. Perhaps to switch to this one would be a work-around if you want to continue to perform test waiting for intel's guy reply. R, Emeric
Re: QAT intermittent healthcheck errors
Hi Marcin, On 5/6/19 3:15 PM, Marcin Deranek wrote: > Hi Emeric, > > On 5/3/19 5:54 PM, Emeric Brun wrote: >> Hi Marcin, >> >> On 5/3/19 4:56 PM, Marcin Deranek wrote: >>> Hi Emeric, >>> >>> On 5/3/19 4:50 PM, Emeric Brun wrote: >>> I've a testing platform here but I don't use the usdm_drv but the qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the doc says to use with my chip) . >>> >>> I see. I use qat 1.7 and qat-engine 0.5.40. >>> Anyway, could you re-compile a haproxy's binary if I provide you a testing patch? >>> >>> Sure, that should not be a problem. >> >> The patch in attachment. > > As I use HAProxy 1.8 I had to adjust the patch (see attachment for end > result). Unfortunately after applying the patch there is no change in > behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy > instances after reload.. > Regards, Ok, the patch adds a ENGINE_finish() call before the reload. I was supposing that the ENGINE_finish would perform the close of the fd because on the application side there is no different way to interact with the engine. Unfortunately, this is not the case. So I will ask to the intel guys what we supposed to do to close this fd. R, Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, On 5/3/19 4:56 PM, Marcin Deranek wrote: > Hi Emeric, > > On 5/3/19 4:50 PM, Emeric Brun wrote: > >> I've a testing platform here but I don't use the usdm_drv but the >> qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the >> doc says to use with my chip) . > > I see. I use qat 1.7 and qat-engine 0.5.40. > >> Anyway, could you re-compile a haproxy's binary if I provide you a testing >> patch? > > Sure, that should not be a problem. The patch in attachment. > >> The idea is to perform a deinit in the master to force a close of those >> '/dev's at each reload. Perhaps It won't fix our issue but this leak of fd >> should not be. > > Hope this will give us at least some more insight.. > Regards, > > Marcin Deranek R, Emeric >From ca57857a492e898759ef211a8fd9714d0f7dd7fa Mon Sep 17 00:00:00 2001 From: Emeric Brun Date: Fri, 3 May 2019 17:06:59 +0200 Subject: [PATCH] BUG/MEDIUM: ssl: fix ssl engine's open fds are leaking. The master didn't call the engine deinit, resulting in a leak of fd opened by the engine during init. The workers inherit of these accumulated fds at each reload. This patch add a call to engine deinit on the master just before reloading with an exec. --- src/haproxy.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/src/haproxy.c b/src/haproxy.c index 603f084c..f77eb1b4 100644 --- a/src/haproxy.c +++ b/src/haproxy.c @@ -588,6 +588,13 @@ void mworker_reload() if (fdtab) deinit_pollers(); +#if defined(USE_OPENSSL) +#ifndef OPENSSL_NO_ENGINE + /* Engines may have opened fds and we must close them */ + ssl_free_engines(); +#endif +#endif + /* restore the initial FD limits */ limit.rlim_cur = rlim_fd_cur_at_boot; limit.rlim_max = rlim_fd_max_at_boot; -- 2.17.1
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 5/3/19 4:50 PM, Emeric Brun wrote: I've a testing platform here but I don't use the usdm_drv but the qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the doc says to use with my chip) . I see. I use qat 1.7 and qat-engine 0.5.40. Anyway, could you re-compile a haproxy's binary if I provide you a testing patch? Sure, that should not be a problem. The idea is to perform a deinit in the master to force a close of those '/dev's at each reload. Perhaps It won't fix our issue but this leak of fd should not be. Hope this will give us at least some more insight.. Regards, Marcin Deranek On 5/3/19 4:21 PM, Marcin Deranek wrote: Hi Emeric, It looks like on every reload master leaks /dev/usdm_drv device: # systemctl restart haproxy.service # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv lrwx-- 1 root root 64 May 3 15:40 9 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null lrwx-- 1 root root 64 May 3 15:40 10 -> /dev/usdm_drv lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv lrwx-- 1 root root 64 May 3 15:40 9 -> /dev/usdm_drv Obviously workers do inherit this from the master. Looking at workers I see the following: * 1st gen: # ls -al /proc/36083/fd|awk '/dev/ {print $NF}'|sort /dev/null /dev/null /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_dev_processes /dev/uio19 /dev/uio3 /dev/uio35 /dev/usdm_drv * 2nd gen: # ls -al /proc/41637/fd|awk '/dev/ {print $NF}'|sort /dev/null /dev/null /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_dev_processes /dev/uio23 /dev/uio39 /dev/uio7 /dev/usdm_drv /dev/usdm_drv Looks like only /dev/usdm_drv is leaked. Cheers, Marcin Deranek On 5/3/19 2:22 PM, Emeric Brun wrote: Hi Marcin, On 4/29/19 6:41 PM, Marcin Deranek wrote: Hi Emeric, On 4/29/19 3:42 PM, Emeric Brun wrote: Hi Marcin, I've also a contact at intel who told me to try this option on the qat engine: --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork Disable/Enable the engine from being initialized automatically following a fork operation. This is useful in a situation where you want to tightly control how many instances are being used for processes. For instance if an application forks to start a process that does not utilize QAT currently the default behaviour is for the engine to still automatically get started in the child using up an engine instance. After using this flag either the engine needs to be initialized manually using the engine message: INIT_ENGINE or will automatically get initialized on the first QAT crypto operation. The initialization on fork is enabled by default. I tried to build QAT Engine with disabled auto init, but that did not help. Now I get the following during startup: 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to initialize memory file handle /dev/usdm_drv 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure " INIT_ENGINE or will automatically get initialized on the first QAT crypto operation" Perhaps the init appears "with first qat crypto operation" and is delayed after the fork so if a chroot is configured, it doesn't allow some accesses to /dev. Could you perform a test in that case without chroot enabled in the haproxy config ? Removed chroot and now it initializes properly. Unfortunately reload still causes "stuck" HAProxy process :-( Marcin Deranek Could you check with "ls -l /proc//fd" if the "/dev/" is open multiple times after a reload? Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, Good so we progress! I've a testing platform here but I don't use the usdm_drv but the qat_contig_mem and I don't reproduce this issue (I'm using QAT 1.5, as the doc says to use with my chip) . Anyway, could you re-compile a haproxy's binary if I provide you a testing patch? The idea is to perform a deinit in the master to force a close of those '/dev's at each reload. Perhaps It won't fix our issue but this leak of fd should not be. R, Emeric On 5/3/19 4:21 PM, Marcin Deranek wrote: > Hi Emeric, > > It looks like on every reload master leaks /dev/usdm_drv device: > > # systemctl restart haproxy.service > # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev > lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null > lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv > > # systemctl reload haproxy.service > # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev > lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null > lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv > lrwx-- 1 root root 64 May 3 15:40 9 -> /dev/usdm_drv > > # systemctl reload haproxy.service > # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev > lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null > lrwx-- 1 root root 64 May 3 15:40 10 -> /dev/usdm_drv > lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv > lrwx-- 1 root root 64 May 3 15:40 9 -> /dev/usdm_drv > > Obviously workers do inherit this from the master. Looking at workers I see > the following: > > * 1st gen: > > # ls -al /proc/36083/fd|awk '/dev/ {print $NF}'|sort > /dev/null > /dev/null > /dev/qat_adf_ctl > /dev/qat_adf_ctl > /dev/qat_adf_ctl > /dev/qat_dev_processes > /dev/uio19 > /dev/uio3 > /dev/uio35 > /dev/usdm_drv > > * 2nd gen: > > # ls -al /proc/41637/fd|awk '/dev/ {print $NF}'|sort > /dev/null > /dev/null > /dev/qat_adf_ctl > /dev/qat_adf_ctl > /dev/qat_adf_ctl > /dev/qat_dev_processes > /dev/uio23 > /dev/uio39 > /dev/uio7 > /dev/usdm_drv > /dev/usdm_drv > > Looks like only /dev/usdm_drv is leaked. > > Cheers, > > Marcin Deranek > > On 5/3/19 2:22 PM, Emeric Brun wrote: >> Hi Marcin, >> >> On 4/29/19 6:41 PM, Marcin Deranek wrote: >>> Hi Emeric, >>> >>> On 4/29/19 3:42 PM, Emeric Brun wrote: Hi Marcin, > >> I've also a contact at intel who told me to try this option on the qat >> engine: >> >>> --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork >>> Disable/Enable the engine from being initialized automatically >>> following a >>> fork operation. This is useful in a situation where you want to >>> tightly >>> control how many instances are being used for processes. For >>> instance if an >>> application forks to start a process that does not utilize QAT >>> currently >>> the default behaviour is for the engine to still automatically >>> get started >>> in the child using up an engine instance. After using this flag >>> either the >>> engine needs to be initialized manually using the engine message: >>> INIT_ENGINE or will automatically get initialized on the first >>> QAT crypto >>> operation. The initialization on fork is enabled by default. > > I tried to build QAT Engine with disabled auto init, but that did not > help. Now I get the following during startup: > > 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 > Unable to initialize memory file handle /dev/usdm_drv > 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 > [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure " INIT_ENGINE or will automatically get initialized on the first QAT crypto operation" Perhaps the init appears "with first qat crypto operation" and is delayed after the fork so if a chroot is configured, it doesn't allow some accesses to /dev. Could you perform a test in that case without chroot enabled in the haproxy config ? >>> >>> Removed chroot and now it initializes properly. Unfortunately reload still >>> causes "stuck" HAProxy process :-( >>> >>> Marcin Deranek >> >> Could you check with "ls -l /proc//fd" if the "/dev/" >> is open multiple times after a reload? >> >> Emeric >>
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, It looks like on every reload master leaks /dev/usdm_drv device: # systemctl restart haproxy.service # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv lrwx-- 1 root root 64 May 3 15:40 9 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -la /proc/$(cat haproxy.pid)/fd|fgrep dev lr-x-- 1 root root 64 May 3 15:40 0 -> /dev/null lrwx-- 1 root root 64 May 3 15:40 10 -> /dev/usdm_drv lrwx-- 1 root root 64 May 3 15:40 7 -> /dev/usdm_drv lrwx-- 1 root root 64 May 3 15:40 9 -> /dev/usdm_drv Obviously workers do inherit this from the master. Looking at workers I see the following: * 1st gen: # ls -al /proc/36083/fd|awk '/dev/ {print $NF}'|sort /dev/null /dev/null /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_dev_processes /dev/uio19 /dev/uio3 /dev/uio35 /dev/usdm_drv * 2nd gen: # ls -al /proc/41637/fd|awk '/dev/ {print $NF}'|sort /dev/null /dev/null /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_adf_ctl /dev/qat_dev_processes /dev/uio23 /dev/uio39 /dev/uio7 /dev/usdm_drv /dev/usdm_drv Looks like only /dev/usdm_drv is leaked. Cheers, Marcin Deranek On 5/3/19 2:22 PM, Emeric Brun wrote: Hi Marcin, On 4/29/19 6:41 PM, Marcin Deranek wrote: Hi Emeric, On 4/29/19 3:42 PM, Emeric Brun wrote: Hi Marcin, I've also a contact at intel who told me to try this option on the qat engine: --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork Disable/Enable the engine from being initialized automatically following a fork operation. This is useful in a situation where you want to tightly control how many instances are being used for processes. For instance if an application forks to start a process that does not utilize QAT currently the default behaviour is for the engine to still automatically get started in the child using up an engine instance. After using this flag either the engine needs to be initialized manually using the engine message: INIT_ENGINE or will automatically get initialized on the first QAT crypto operation. The initialization on fork is enabled by default. I tried to build QAT Engine with disabled auto init, but that did not help. Now I get the following during startup: 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to initialize memory file handle /dev/usdm_drv 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure " INIT_ENGINE or will automatically get initialized on the first QAT crypto operation" Perhaps the init appears "with first qat crypto operation" and is delayed after the fork so if a chroot is configured, it doesn't allow some accesses to /dev. Could you perform a test in that case without chroot enabled in the haproxy config ? Removed chroot and now it initializes properly. Unfortunately reload still causes "stuck" HAProxy process :-( Marcin Deranek Could you check with "ls -l /proc//fd" if the "/dev/" is open multiple times after a reload? Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, On 4/29/19 6:41 PM, Marcin Deranek wrote: > Hi Emeric, > > On 4/29/19 3:42 PM, Emeric Brun wrote: >> Hi Marcin, >> >>> I've also a contact at intel who told me to try this option on the qat engine: > --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork > Disable/Enable the engine from being initialized automatically > following a > fork operation. This is useful in a situation where you want to > tightly > control how many instances are being used for processes. For > instance if an > application forks to start a process that does not utilize QAT > currently > the default behaviour is for the engine to still automatically get > started > in the child using up an engine instance. After using this flag > either the > engine needs to be initialized manually using the engine message: > INIT_ENGINE or will automatically get initialized on the first QAT > crypto > operation. The initialization on fork is enabled by default. >>> >>> I tried to build QAT Engine with disabled auto init, but that did not help. >>> Now I get the following during startup: >>> >>> 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 >>> Unable to initialize memory file handle /dev/usdm_drv >>> 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 >>> [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure >> >> " INIT_ENGINE or will automatically get initialized on the first QAT crypto >> operation" >> >> Perhaps the init appears "with first qat crypto operation" and is delayed >> after the fork so if a chroot is configured, it doesn't allow some accesses >> to /dev. Could you perform a test in that case without chroot enabled in the >> haproxy config ? > > Removed chroot and now it initializes properly. Unfortunately reload still > causes "stuck" HAProxy process :-( > > Marcin Deranek Could you check with "ls -l /proc//fd" if the "/dev/" is open multiple times after a reload? Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 4/29/19 3:42 PM, Emeric Brun wrote: Hi Marcin, I've also a contact at intel who told me to try this option on the qat engine: --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork Disable/Enable the engine from being initialized automatically following a fork operation. This is useful in a situation where you want to tightly control how many instances are being used for processes. For instance if an application forks to start a process that does not utilize QAT currently the default behaviour is for the engine to still automatically get started in the child using up an engine instance. After using this flag either the engine needs to be initialized manually using the engine message: INIT_ENGINE or will automatically get initialized on the first QAT crypto operation. The initialization on fork is enabled by default. I tried to build QAT Engine with disabled auto init, but that did not help. Now I get the following during startup: 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to initialize memory file handle /dev/usdm_drv 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure " INIT_ENGINE or will automatically get initialized on the first QAT crypto operation" Perhaps the init appears "with first qat crypto operation" and is delayed after the fork so if a chroot is configured, it doesn't allow some accesses to /dev. Could you perform a test in that case without chroot enabled in the haproxy config ? Removed chroot and now it initializes properly. Unfortunately reload still causes "stuck" HAProxy process :-( Marcin Deranek
Re: QAT intermittent healthcheck errors
Hi Marcin, > >> I've also a contact at intel who told me to try this option on the qat >> engine: >> >>> --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork >>> Disable/Enable the engine from being initialized automatically >>> following a >>> fork operation. This is useful in a situation where you want to tightly >>> control how many instances are being used for processes. For instance >>> if an >>> application forks to start a process that does not utilize QAT >>> currently >>> the default behaviour is for the engine to still automatically get >>> started >>> in the child using up an engine instance. After using this flag either >>> the >>> engine needs to be initialized manually using the engine message: >>> INIT_ENGINE or will automatically get initialized on the first QAT >>> crypto >>> operation. The initialization on fork is enabled by default. > > I tried to build QAT Engine with disabled auto init, but that did not help. > Now I get the following during startup: > > 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable > to initialize memory file handle /dev/usdm_drv > 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 > [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure " INIT_ENGINE or will automatically get initialized on the first QAT crypto operation" Perhaps the init appears "with first qat crypto operation" and is delayed after the fork so if a chroot is configured, it doesn't allow some accesses to /dev. Could you perform a test in that case without chroot enabled in the haproxy config ? > > Probably engine is not manually initialized after forking. > Regards, > > Marcin Deranek Emeric
Re: QAT intermittent healthcheck errors
Hi Emeric, On 4/29/19 2:47 PM, Emeric Brun wrote: Hi Marcin, On 4/19/19 3:26 PM, Marcin Deranek wrote: Hi Emeric, On 4/18/19 4:35 PM, Emeric Brun wrote: An other interesting trace would be to perform a "show sess" command on a stucked process through the master cli. And also the "show fd" Here it is: show proc # 13409 master 0 1 0d 00h03m30s # workers 15084 worker 1 0 0d 00h03m20s 15085 worker 2 0 0d 00h03m20s 15086 worker 3 0 0d 00h03m20s 15087 worker 4 0 0d 00h03m20s # old workers 13415 worker [was: 1] 1 0d 00h03m30s 13416 worker [was: 2] 1 0d 00h03m30s 13417 worker [was: 3] 1 0d 00h03m30s 13418 worker [was: 4] 1 0d 00h03m30s @!13415 show sess 0x4eee9c0: proto=sockpair ts=0a age=0s calls=1 rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp= @!13415 show fd 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x1a74ae0 iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4fe1860 iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x47dfd50 87 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3ec1150 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 88 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3c237d0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 @!13416 show sess 0x48f2990: proto=sockpair ts=0a age=0s calls=1 rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp= @!13416 show fd 15 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x34c1540 iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4b3cff0 iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x4f0e510 75 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3a6b2f0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 76 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x43a34e0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 Marcin Deranek 87,88,75,76 appears to be async engine FDs and should be cleaned. I will dig for that. Thank you. I've also a contact at intel who told me to try this option on the qat engine: --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork Disable/Enable the engine from being initialized automatically following a fork operation. This is useful in a situation where you want to tightly control how many instances are being used for processes. For instance if an application forks to start a process that does not utilize QAT currently the default behaviour is for the engine to still automatically get started in the child using up an engine instance. After using this flag either the engine needs to be initialized manually using the engine message: INIT_ENGINE or will automatically get initialized on the first QAT crypto operation. The initialization on fork is enabled by default. I tried to build QAT Engine with disabled auto init, but that did not help. Now I get the following during startup: 2019-04-29T15:13:47.142297+02:00 host1 hapee-lb[16604]: qaeOpenFd:753 Unable to initialize memory file handle /dev/usdm_drv 2019-04-29T15:13:47+02:00 localhost hapee-lb[16611]: 127.0.0.1:60512 [29/Apr/2019:15:13:47.139] vip1/23: SSL handshake failure Probably engine is not manually initialized after forking. Regards, Marcin Deranek
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, On 4/19/19 3:26 PM, Marcin Deranek wrote: > Hi Emeric, > > On 4/18/19 4:35 PM, Emeric Brun wrote: >>> An other interesting trace would be to perform a "show sess" command on a >>> stucked process through the master cli. >> >> And also the "show fd" > > Here it is: > > show proc > # > 13409 master 0 1 0d 00h03m30s > # workers > 15084 worker 1 0 0d 00h03m20s > 15085 worker 2 0 0d 00h03m20s > 15086 worker 3 0 0d 00h03m20s > 15087 worker 4 0 0d 00h03m20s > # old workers > 13415 worker [was: 1] 1 0d 00h03m30s > 13416 worker [was: 2] 1 0d 00h03m30s > 13417 worker [was: 3] 1 0d 00h03m30s > 13418 worker [was: 4] 1 0d 00h03m30s > > @!13415 show sess > 0x4eee9c0: proto=sockpair ts=0a age=0s calls=1 > rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] > s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp= > > @!13415 show fd > 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x1a74ae0 > iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0 > 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 > iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0 > 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4fe1860 > iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL > mux=PASS mux_ctx=0x47dfd50 > 87 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3ec1150 > iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 > 88 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3c237d0 > iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 > > @!13416 show sess > 0x48f2990: proto=sockpair ts=0a age=0s calls=1 > rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] > s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp= > > @!13416 show fd > 15 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x34c1540 > iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0 > 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 > iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0 > 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4b3cff0 > iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL > mux=PASS mux_ctx=0x4f0e510 > 75 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3a6b2f0 > iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 > 76 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x43a34e0 > iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 > > Marcin Deranek 87,88,75,76 appears to be async engine FDs and should be cleaned. I will dig for that. I've also a contact at intel who told me to try this option on the qat engine: > --disable-qat_auto_engine_init_on_fork/--enable-qat_auto_engine_init_on_fork > Disable/Enable the engine from being initialized automatically following a > fork operation. This is useful in a situation where you want to tightly > control how many instances are being used for processes. For instance if > an > application forks to start a process that does not utilize QAT currently > the default behaviour is for the engine to still automatically get started > in the child using up an engine instance. After using this flag either the > engine needs to be initialized manually using the engine message: > INIT_ENGINE or will automatically get initialized on the first QAT crypto > operation. The initialization on fork is enabled by default. R, Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 4/18/19 4:35 PM, Emeric Brun wrote: An other interesting trace would be to perform a "show sess" command on a stucked process through the master cli. And also the "show fd" Here it is: show proc # 13409 master 0 1 0d 00h03m30s # workers 15084 worker 1 0 0d 00h03m20s 15085 worker 2 0 0d 00h03m20s 15086 worker 3 0 0d 00h03m20s 15087 worker 4 0 0d 00h03m20s # old workers 13415 worker [was: 1]1 0d 00h03m30s 13416 worker [was: 2]1 0d 00h03m30s 13417 worker [was: 3]1 0d 00h03m30s 13418 worker [was: 4]1 0d 00h03m30s @!13415 show sess 0x4eee9c0: proto=sockpair ts=0a age=0s calls=1 rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp= @!13415 show fd 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x1a74ae0 iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4fe1860 iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x47dfd50 87 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3ec1150 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 88 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3c237d0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 @!13416 show sess 0x48f2990: proto=sockpair ts=0a age=0s calls=1 rq[f=40c0c220h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=20,ex=] s1=[7,4018h,fd=-1,ex=] exp= @!13416 show fd 15 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x34c1540 iocb=0x487760(mworker_accept_wrapper) tmask=0x1 umask=0x0 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e19f0 iocb=0x4e19f0(thread_sync_io_handler) tmask=0x umask=0x0 20 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x4b3cff0 iocb=0x4ce620(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x4f0e510 75 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x3a6b2f0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 76 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x43a34e0 iocb=0x4f5d30(unknown) tmask=0x1 umask=0x0 Marcin Deranek
Re: QAT intermittent healthcheck errors
On 4/18/19 11:06 AM, Emeric Brun wrote: I think you can do that this way: Remove the option httchk (or prefix it by "no": "no option httchk " if it is configured into the defaults section and add the following 2 lines: option tcp-check tcp-check connect This shouldn't perform the handshake but just validate that the port is open. The regular traffic will continue to use the ssl on server side. Enabling TCP checks has the very same effect as disabling them: reload works just fine. Marcin Deranek
Re: QAT intermittent healthcheck errors
On 4/18/19 11:06 AM, Emeric Brun wrote: > Hi Marcin, > > On 4/12/19 6:10 PM, Marcin Deranek wrote: >> Hi Emeric, >> >> On 4/12/19 5:26 PM, Emeric Brun wrote: >> >>> Do you have ssl enabled on the server side? >> >> Yes, ssl is on frontend and backend with ssl checks enabled. >> >>> If it is the case could replace health check with a simple tcp check >>> (without ssl)? >> >> What I noticed before that if I (re)start HAProxy and reload immediately no >> stuck processes are present. If I wait before reloading stuck processes show >> up. >> After disabling checks (I still keep ssl enabled for normal traffic) reloads >> work just fine (tried many time). Do you know how to enable TCP healthchecks >> while keeping SSL for non-healthcheck requests ? > > I think you can do that this way: > > Remove the option httchk (or prefix it by "no": "no option httchk " if it is > configured into the defaults section > > and add the following 2 lines: > > option tcp-check > tcp-check connect > > This shouldn't perform the handshake but just validate that the port is open. > The regular traffic will continue to use the ssl > on server side. > > >>> Regarding the show info/lsoff it seems there is no more sessions on client >>> side but remaining ssl jobs (CurrSslConns) and I supsect the health checks >>> to miss a cleanup of their ssl sessions using the QAT. (this is just an >>> assumption) >> >> In general instance where I test QAT does not have any "real" client traffic >> except small amount of healtcheck requests per frontend which are internally >> handled by HAProxy itself. Still TLS handshake still needs to take place. >> There are many more backend healthchecks. Looks like your assumption was >> correct.. > > Good!, We continue to dig in that direction. > > An other interesting trace would be to perform a "show sess" command on a > stucked process through the master cli. And also the "show fd" R, Emeric
Re: QAT intermittent healthcheck errors
Hi Marcin, On 4/12/19 6:10 PM, Marcin Deranek wrote: > Hi Emeric, > > On 4/12/19 5:26 PM, Emeric Brun wrote: > >> Do you have ssl enabled on the server side? > > Yes, ssl is on frontend and backend with ssl checks enabled. > >> If it is the case could replace health check with a simple tcp check >> (without ssl)? > > What I noticed before that if I (re)start HAProxy and reload immediately no > stuck processes are present. If I wait before reloading stuck processes show > up. > After disabling checks (I still keep ssl enabled for normal traffic) reloads > work just fine (tried many time). Do you know how to enable TCP healthchecks > while keeping SSL for non-healthcheck requests ? I think you can do that this way: Remove the option httchk (or prefix it by "no": "no option httchk " if it is configured into the defaults section and add the following 2 lines: option tcp-check tcp-check connect This shouldn't perform the handshake but just validate that the port is open. The regular traffic will continue to use the ssl on server side. >> Regarding the show info/lsoff it seems there is no more sessions on client >> side but remaining ssl jobs (CurrSslConns) and I supsect the health checks >> to miss a cleanup of their ssl sessions using the QAT. (this is just an >> assumption) > > In general instance where I test QAT does not have any "real" client traffic > except small amount of healtcheck requests per frontend which are internally > handled by HAProxy itself. Still TLS handshake still needs to take place. > There are many more backend healthchecks. Looks like your assumption was > correct.. Good!, We continue to dig in that direction. An other interesting trace would be to perform a "show sess" command on a stucked process through the master cli. R, Emeric
Re: QAT intermittent healthcheck errors
Hi Emeric, On 4/12/19 5:26 PM, Emeric Brun wrote: Do you have ssl enabled on the server side? Yes, ssl is on frontend and backend with ssl checks enabled. If it is the case could replace health check with a simple tcp check (without ssl)? What I noticed before that if I (re)start HAProxy and reload immediately no stuck processes are present. If I wait before reloading stuck processes show up. After disabling checks (I still keep ssl enabled for normal traffic) reloads work just fine (tried many time). Do you know how to enable TCP healthchecks while keeping SSL for non-healthcheck requests ? Regarding the show info/lsoff it seems there is no more sessions on client side but remaining ssl jobs (CurrSslConns) and I supsect the health checks to miss a cleanup of their ssl sessions using the QAT. (this is just an assumption) In general instance where I test QAT does not have any "real" client traffic except small amount of healtcheck requests per frontend which are internally handled by HAProxy itself. Still TLS handshake still needs to take place. There are many more backend healthchecks. Looks like your assumption was correct.. Regards, Marcin Deranek On 4/12/19 4:43 PM, Marcin Deranek wrote: Hi Emeric, On 4/10/19 2:20 PM, Emeric Brun wrote: On 4/10/19 1:02 PM, Marcin Deranek wrote: Hi Emeric, Our process limit in QAT configuration is quite high (128) and I was able to run 100+ openssl processes without a problem. According to Joel from Intel problem is in cleanup code - presumably when HAProxy exits and frees up QAT resources. Will try to see if I can get more debug information. I've just take a look. Engines deinit ar called: haproxy/src/ssl_sock.c #ifndef OPENSSL_NO_ENGINE void ssl_free_engines(void) { struct ssl_engine_list *wl, *wlb; /* free up engine list */ list_for_each_entry_safe(wl, wlb, _engines, list) { ENGINE_finish(wl->e); ENGINE_free(wl->e); LIST_DEL(>list); free(wl); } } #endif ... #ifndef OPENSSL_NO_ENGINE hap_register_post_deinit(ssl_free_engines); #endif I don't know how many haproxy processes you are running but if I describe the complete scenario of processes you may note that we reach a limit: It's very unlikely it's the limit as I lowered number of HAProxy processes (from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have problem with this limit while spawning new instances and not tearing down old ones. In such a case QAT would not be initialized for some HAProxy instances (you would see 1 thread vs 2 thread). About threads read below. - the master sends a signal to older processes, those process will unbind and stop to accept new conns but continue to serve remaining sessions until the end. - new processes are started and immediately and init the engine and accept newconns. - When no more sessions remains on an old process, it calls the deinit function of the engine before exiting What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) - looks like QAT adds extra thread to the process itself. Would adding extra thread possibly mess up HAProxy termination sequence ? Our setup is to run HAProxy in multi process mode - no threads (or 1 thread per process if you wish). I'm also supposed that old processes are stucked because there is some sessions which never ended, perhaps I'm wrong but a strace on an old process could be interesting to know why those processes are stucked. strace only shows these: [pid 11392] 23:24:43.164619 epoll_wait(4, [pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0 [pid 11392] 23:24:43.164761 epoll_wait(4, [pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0 [pid 11392] 23:24:43.953286 epoll_wait(4, [pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0 [pid 11392] 23:24:43.953419 epoll_wait(4, [pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0 [pid 11392] 23:24:44.010589 epoll_wait(4, There are no connections: stucked process only has UDP socket on random port: [root@externallb-124 ~]# lsof -p 6307|fgrep IPv4 hapee-lb 6307 lbengine 83u IPv4 3598779351 0t0 UDP *:19573 You can also use the 'master CLI' using '-S' and you could check if it remains sessions on those older processes (doc is available in management.txt) Before reload * systemd Main PID: 33515 (hapee-lb) Memory: 1.6G CGroup: /system.slice/hapee-1.8-lb.service ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34860
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, Do you have ssl enabled on the server side? If it is the case could replace health check with a simple tcp check (without ssl)? Regarding the show info/lsoff it seems there is no more sessions on client side but remaining ssl jobs (CurrSslConns) and I supsect the health checks to miss a cleanup of their ssl sessions using the QAT. (this is just an assumption) R, Emeric On 4/12/19 4:43 PM, Marcin Deranek wrote: > Hi Emeric, > > On 4/10/19 2:20 PM, Emeric Brun wrote: > >> On 4/10/19 1:02 PM, Marcin Deranek wrote: >>> Hi Emeric, >>> >>> Our process limit in QAT configuration is quite high (128) and I was able >>> to run 100+ openssl processes without a problem. According to Joel from >>> Intel problem is in cleanup code - presumably when HAProxy exits and frees >>> up QAT resources. Will try to see if I can get more debug information. >> >> I've just take a look. >> >> Engines deinit ar called: >> >> haproxy/src/ssl_sock.c >> #ifndef OPENSSL_NO_ENGINE >> void ssl_free_engines(void) { >> struct ssl_engine_list *wl, *wlb; >> /* free up engine list */ >> list_for_each_entry_safe(wl, wlb, _engines, list) { >> ENGINE_finish(wl->e); >> ENGINE_free(wl->e); >> LIST_DEL(>list); >> free(wl); >> } >> } >> #endif >> ... >> #ifndef OPENSSL_NO_ENGINE >> hap_register_post_deinit(ssl_free_engines); >> #endif >> >> I don't know how many haproxy processes you are running but if I describe >> the complete scenario of processes you may note that we reach a limit: > > It's very unlikely it's the limit as I lowered number of HAProxy processes > (from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have > problem with this limit while spawning new instances and not tearing down old > ones. In such a case QAT would not be initialized for some HAProxy instances > (you would see 1 thread vs 2 thread). About threads read below. > >> - the master sends a signal to older processes, those process will unbind >> and stop to accept new conns but continue to serve remaining sessions until >> the end. >> - new processes are started and immediately and init the engine and accept >> newconns. >> - When no more sessions remains on an old process, it calls the deinit >> function of the engine before exiting > > What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) - > looks like QAT adds extra thread to the process itself. Would adding extra > thread possibly mess up HAProxy termination sequence ? > Our setup is to run HAProxy in multi process mode - no threads (or 1 thread > per process if you wish). > >> I'm also supposed that old processes are stucked because there is some >> sessions which never ended, perhaps I'm wrong but a strace on an old process >> could be interesting to know why those processes are stucked. > > strace only shows these: > > [pid 11392] 23:24:43.164619 epoll_wait(4, > [pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0 > [pid 11392] 23:24:43.164761 epoll_wait(4, > [pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0 > [pid 11392] 23:24:43.953286 epoll_wait(4, > [pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0 > [pid 11392] 23:24:43.953419 epoll_wait(4, > [pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0 > [pid 11392] 23:24:44.010589 epoll_wait(4, > > There are no connections: stucked process only has UDP socket on random port: > > [root@externallb-124 ~]# lsof -p 6307|fgrep IPv4 > hapee-lb 6307 lbengine 83u IPv4 3598779351 0t0 UDP *:19573 > > >> You can also use the 'master CLI' using '-S' and you could check if it >> remains sessions on those older processes (doc is available in >> management.txt) > > Before reload > * systemd > Main PID: 33515 (hapee-lb) > Memory: 1.6G > CGroup: /system.slice/hapee-1.8-lb.service > ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f > /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 > ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f > /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 > ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f > /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 > ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f > /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 > └─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f > /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 > * master CLI > show proc > # > 33515 master 0 0 0d 00h00m31s > # workers > 34858 worker 1 0 0d 00h00m31s > 34859 worker 2 0 0d 00h00m31s > 34860 worker 3 0 0d 00h00m31s >
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 4/10/19 2:20 PM, Emeric Brun wrote: On 4/10/19 1:02 PM, Marcin Deranek wrote: Hi Emeric, Our process limit in QAT configuration is quite high (128) and I was able to run 100+ openssl processes without a problem. According to Joel from Intel problem is in cleanup code - presumably when HAProxy exits and frees up QAT resources. Will try to see if I can get more debug information. I've just take a look. Engines deinit ar called: haproxy/src/ssl_sock.c #ifndef OPENSSL_NO_ENGINE void ssl_free_engines(void) { struct ssl_engine_list *wl, *wlb; /* free up engine list */ list_for_each_entry_safe(wl, wlb, _engines, list) { ENGINE_finish(wl->e); ENGINE_free(wl->e); LIST_DEL(>list); free(wl); } } #endif ... #ifndef OPENSSL_NO_ENGINE hap_register_post_deinit(ssl_free_engines); #endif I don't know how many haproxy processes you are running but if I describe the complete scenario of processes you may note that we reach a limit: It's very unlikely it's the limit as I lowered number of HAProxy processes (from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have problem with this limit while spawning new instances and not tearing down old ones. In such a case QAT would not be initialized for some HAProxy instances (you would see 1 thread vs 2 thread). About threads read below. - the master sends a signal to older processes, those process will unbind and stop to accept new conns but continue to serve remaining sessions until the end. - new processes are started and immediately and init the engine and accept newconns. - When no more sessions remains on an old process, it calls the deinit function of the engine before exiting What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) - looks like QAT adds extra thread to the process itself. Would adding extra thread possibly mess up HAProxy termination sequence ? Our setup is to run HAProxy in multi process mode - no threads (or 1 thread per process if you wish). I'm also supposed that old processes are stucked because there is some sessions which never ended, perhaps I'm wrong but a strace on an old process could be interesting to know why those processes are stucked. strace only shows these: [pid 11392] 23:24:43.164619 epoll_wait(4, [pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0 [pid 11392] 23:24:43.164761 epoll_wait(4, [pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0 [pid 11392] 23:24:43.953286 epoll_wait(4, [pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0 [pid 11392] 23:24:43.953419 epoll_wait(4, [pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0 [pid 11392] 23:24:44.010589 epoll_wait(4, There are no connections: stucked process only has UDP socket on random port: [root@externallb-124 ~]# lsof -p 6307|fgrep IPv4 hapee-lb 6307 lbengine 83u IPv4 3598779351 0t0 UDP *:19573 You can also use the 'master CLI' using '-S' and you could check if it remains sessions on those older processes (doc is available in management.txt) Before reload * systemd Main PID: 33515 (hapee-lb) Memory: 1.6G CGroup: /system.slice/hapee-1.8-lb.service ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 └─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 * master CLI show proc # 33515 master 0 0 0d 00h00m31s # workers 34858 worker 1 0 0d 00h00m31s 34859 worker 2 0 0d 00h00m31s 34860 worker 3 0 0d 00h00m31s 34861 worker 4 0 0d 00h00m31s After reload: * systemd Main PID: 33515 (hapee-lb) Memory: 3.1G CGroup: /system.slice/hapee-1.8-lb.service ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858 34859 34860 34861 -x /run/lb_engine/process-1.sock ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, > You can also use the 'master CLI' using '-S' and you could check if it > remains sessions on those older processes (doc is available in management.txt) Here the doc: https://cbonte.github.io/haproxy-dconv/1.9/management.html#9.4 Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, On 4/10/19 1:02 PM, Marcin Deranek wrote: > Hi Emeric, > > Our process limit in QAT configuration is quite high (128) and I was able to > run 100+ openssl processes without a problem. According to Joel from Intel > problem is in cleanup code - presumably when HAProxy exits and frees up QAT > resources. Will try to see if I can get more debug information. I've just take a look. Engines deinit ar called: haproxy/src/ssl_sock.c #ifndef OPENSSL_NO_ENGINE void ssl_free_engines(void) { struct ssl_engine_list *wl, *wlb; /* free up engine list */ list_for_each_entry_safe(wl, wlb, _engines, list) { ENGINE_finish(wl->e); ENGINE_free(wl->e); LIST_DEL(>list); free(wl); } } #endif ... #ifndef OPENSSL_NO_ENGINE hap_register_post_deinit(ssl_free_engines); #endif I don't know how many haproxy processes you are running but if I describe the complete scenario of processes you may note that we reach a limit: - the master sends a signal to older processes, those process will unbind and stop to accept new conns but continue to serve remaining sessions until the end. - new processes are started and immediately and init the engine and accept newconns. - When no more sessions remains on an old process, it calls the deinit function of the engine before exiting So there is a time window where you have 2x the number of processes configured in haproxy using the engine. I'm also supposed that old processes are stucked because there is some sessions which never ended, perhaps I'm wrong but a strace on an old process could be interesting to know why those processes are stucked. You can also use the 'master CLI' using '-S' and you could check if it remains sessions on those older processes (doc is available in management.txt) Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, Our process limit in QAT configuration is quite high (128) and I was able to run 100+ openssl processes without a problem. According to Joel from Intel problem is in cleanup code - presumably when HAProxy exits and frees up QAT resources. Will try to see if I can get more debug information. Regards, Marcin Deranek On 4/9/19 5:17 PM, Emeric Brun wrote: Hi Marcin, On 4/9/19 3:07 PM, Marcin Deranek wrote: Hi Emeric, I have followed all instructions and I got to the point where HAProxy starts and does the job using QAT (backend healthchecks work and I frontend can provide content over HTTPS). The problems starts when HAProxy gets reloaded. With our current configuration on reload old HAProxy processes do not exit, so after reload you end up with 2 generations of HAProxy processes: before reload and after reload. I tried to find out what are conditions in which HAProxy processes get "stuck" and I was not able to replicate it consistently. In one case it was related to amount of backend servers with 'ssl' on their line, but trying to add 'ssl' to some other servers in other place had no effect. Interestingly in some cases for example with simple configuration (1 frontend + 1 backend) HAProxy produced errors on reload (see attachment) - in those cases processes rarely got "stuck" even though errors were present. /dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to get this fixed / resolved would be welcome. Regards, Marcin Deranek I've checked the errors.txt and all the messages were written by the engine and are not part of the haproxy code. I can only do supposition for now but I think we face a first error due to a limitation of the amount of processes trying to access the engine: the reload will double the number of processes trying to attach the engine. Perhaps this issue can be bypassed tweaking the qat configuration file (some advise, from intel would be wellcome). For the old stucked processes: I think the grow of processes also triggers errors on already attached ones in the qat engine but currently I ignore the way this errors are/should be raised to the application, it appears that they are currently not handled and that's why processes would be stuck (sessions may appear still valid for haproxy so the old process continues to wait for their end). We expected they were raised by the openssl API but it appears to not be the case. We have to check if we miss to handle an error polling events on the file descriptor used to communicate with engine. So we have to dig deeper and any help from Intel's guy or Qat aware devs will be appreciate. Emeric
Re: QAT intermittent healthcheck errors
Hi Marcin, On 4/9/19 3:07 PM, Marcin Deranek wrote: > Hi Emeric, > > I have followed all instructions and I got to the point where HAProxy starts > and does the job using QAT (backend healthchecks work and I frontend can > provide content over HTTPS). The problems starts when HAProxy gets reloaded. > With our current configuration on reload old HAProxy processes do not exit, > so after reload you end up with 2 generations of HAProxy processes: before > reload and after reload. I tried to find out what are conditions in which > HAProxy processes get "stuck" and I was not able to replicate it > consistently. In one case it was related to amount of backend servers with > 'ssl' on their line, but trying to add 'ssl' to some other servers in other > place had no effect. Interestingly in some cases for example with simple > configuration (1 frontend + 1 backend) HAProxy produced errors on reload (see > attachment) - in those cases processes rarely got "stuck" even though errors > were present. > /dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to > get this fixed / resolved would be welcome. > Regards, > > Marcin Deranek I've checked the errors.txt and all the messages were written by the engine and are not part of the haproxy code. I can only do supposition for now but I think we face a first error due to a limitation of the amount of processes trying to access the engine: the reload will double the number of processes trying to attach the engine. Perhaps this issue can be bypassed tweaking the qat configuration file (some advise, from intel would be wellcome). For the old stucked processes: I think the grow of processes also triggers errors on already attached ones in the qat engine but currently I ignore the way this errors are/should be raised to the application, it appears that they are currently not handled and that's why processes would be stuck (sessions may appear still valid for haproxy so the old process continues to wait for their end). We expected they were raised by the openssl API but it appears to not be the case. We have to check if we miss to handle an error polling events on the file descriptor used to communicate with engine. So we have to dig deeper and any help from Intel's guy or Qat aware devs will be appreciate. Emeric
Re: QAT intermittent healthcheck errors
Hi Emeric, I have followed all instructions and I got to the point where HAProxy starts and does the job using QAT (backend healthchecks work and I frontend can provide content over HTTPS). The problems starts when HAProxy gets reloaded. With our current configuration on reload old HAProxy processes do not exit, so after reload you end up with 2 generations of HAProxy processes: before reload and after reload. I tried to find out what are conditions in which HAProxy processes get "stuck" and I was not able to replicate it consistently. In one case it was related to amount of backend servers with 'ssl' on their line, but trying to add 'ssl' to some other servers in other place had no effect. Interestingly in some cases for example with simple configuration (1 frontend + 1 backend) HAProxy produced errors on reload (see attachment) - in those cases processes rarely got "stuck" even though errors were present. /dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to get this fixed / resolved would be welcome. Regards, Marcin Deranek On 3/13/19 12:04 PM, Emeric Brun wrote: Hi Marcin, On 3/11/19 4:27 PM, Marcin Deranek wrote: On 3/11/19 11:51 AM, Emeric Brun wrote: Mode async is enabled on both sides, server and frontend side. But on server side, haproxy is using session resuming, so there is a new key computation (full handshake with RSA/DSA computation) only every 5 minutes (openssl default value). You can force to recompute each time setting "no-ssl-reuse" on server line, but it will add a heavy load for ssl computation on the server. Indeed, setting no-ssl-reuse makes use of QAT for healthchecks. Looks like finally we are ready for QAT testing. Thank you Emeric. Regards, Marcin Deranek I've just re-check and i think you should also enable the 'PKEY_CRYPTO' algo to the engine ssl-engine qat algo RSA,DSA,EC,DH,PKEY_CRYPTO It will enable rhe offloading of the TLS1-PRF you can see there: # /opt/booking-openssl/bin/openssl engine -c qat (qat) Reference implementation of QAT crypto engine [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF] R, Emeric 2019-04-09T14:22:45.523342+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #1 (61249) forked 2019-04-09T14:22:45.523368+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #2 (61250) forked 2019-04-09T14:22:45.523393+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #3 (61251) forked 2019-04-09T14:22:45.523418+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #4 (61252) forked 2019-04-09T14:22:45.523444+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #5 (61253) forked 2019-04-09T14:22:45.523469+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #6 (61255) forked 2019-04-09T14:22:45.523493+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #7 (61258) forked 2019-04-09T14:22:45.523518+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #8 (61259) forked 2019-04-09T14:22:45.523548+02:00 externallb hapee-lb[60816]: [NOTICE] 098/142244 (60816) : New worker #9 (61261) forked 2019-04-09T14:22:45.523596+02:00 externallb hapee-lb[60816]: [error] cpaCyStopInstance() - : Can not get instance info 2019-04-09T14:22:45.523623+02:00 externallb hapee-lb[60816]: [error] SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF 2019-04-09T14:22:45.523649+02:00 externallb hapee-lb[60816]: [error] SalCtrl_ServiceEventHandler() - : Failed to get enabled services 2019-04-09T14:22:45.523674+02:00 externallb hapee-lb[60816]: [error] SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF 2019-04-09T14:22:45.523699+02:00 externallb hapee-lb[60816]: [error] SalCtrl_ServiceEventHandler() - : Failed to get enabled services 2019-04-09T14:22:45.523724+02:00 externallb hapee-lb[60816]: [error] SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF 2019-04-09T14:22:45.523749+02:00 externallb hapee-lb[60816]: [error] SalCtrl_ServiceEventHandler() - : Failed to get enabled services 2019-04-09T14:22:45.523774+02:00 externallb hapee-lb[60816]: [error] SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF 2019-04-09T14:22:45.523799+02:00 externallb hapee-lb[60816]: [error] SalCtrl_ServiceEventHandler() - : Failed to get enabled services 2019-04-09T14:22:45.523823+02:00 externallb hapee-lb[60816]: [error] SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF 2019-04-09T14:22:45.523848+02:00 externallb hapee-lb[60816]: [error] SalCtrl_ServiceEventHandler() - : Failed to get enabled services 2019-04-09T14:22:45.523874+02:00 externallb hapee-lb[60816]: [error] SalCtrl_GetEnabledServices() - : Failed to get enabled services from ADF 2019-04-09T14:22:45.523899+02:00
Re: [External] Re: QAT intermittent healthcheck errors
Hi Marcin, On 3/11/19 4:27 PM, Marcin Deranek wrote: > On 3/11/19 11:51 AM, Emeric Brun wrote: > >> Mode async is enabled on both sides, server and frontend side. >> >> But on server side, haproxy is using session resuming, so there is a new key >> computation (full handshake with RSA/DSA computation) only every 5 minutes >> (openssl default value). >> >> You can force to recompute each time setting "no-ssl-reuse" on server line, >> but it will add a heavy load for ssl computation on the server. > > Indeed, setting no-ssl-reuse makes use of QAT for healthchecks. > Looks like finally we are ready for QAT testing. > Thank you Emeric. > Regards, > > Marcin Deranek > I've just re-check and i think you should also enable the 'PKEY_CRYPTO' algo to the engine ssl-engine qat algo RSA,DSA,EC,DH,PKEY_CRYPTO It will enable rhe offloading of the TLS1-PRF you can see there: # /opt/booking-openssl/bin/openssl engine -c qat (qat) Reference implementation of QAT crypto engine [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF] R, Emeric
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 3/11/19 2:48 PM, Emeric Brun wrote: Once again, you could add the "no-ssl-reuse" statement if you want to check if QAT offloads the backend side, but it is clearly not an optimal option for production because it will generate an heavy load on your servers and force them to recompute keys for each connections. I just wanted to make sure that QAT is involved in both and does what it suppose to do based on data rather than hope or trust :-)) We won't be running it with no-ssl-reuse as for obvious reasons we don't want to make more load than necessary. Thank you once again for your help. Regards, Marcin Deranek
Re: [External] Re: QAT intermittent healthcheck errors
On 3/11/19 11:51 AM, Emeric Brun wrote: Mode async is enabled on both sides, server and frontend side. But on server side, haproxy is using session resuming, so there is a new key computation (full handshake with RSA/DSA computation) only every 5 minutes (openssl default value). You can force to recompute each time setting "no-ssl-reuse" on server line, but it will add a heavy load for ssl computation on the server. Indeed, setting no-ssl-reuse makes use of QAT for healthchecks. Looks like finally we are ready for QAT testing. Thank you Emeric. Regards, Marcin Deranek
Re: QAT intermittent healthcheck errors
On 3/11/19 11:51 AM, Emeric Brun wrote: > On 3/11/19 11:06 AM, Marcin Deranek wrote: >> Hi Emeric, >> >> On 3/8/19 11:24 AM, Emeric Brun wrote: >>> Are you sure that servers won't use ECDSA certificates? Do you check that >>> conn are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384' >> >> Backend servers only support TLS 1.2 and RSA certificates. >> >>> Could you check algo supported by QAT doing this ?: >>> openssl engine -c qat >> >> # /opt/booking-openssl/bin/openssl engine -c qat >> (qat) Reference implementation of QAT crypto engine >> [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, >> AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF] >> >>> Could you retry with this config: >>> ssl-engine qat algo RSA,DSA,EC,DH >> >> Just did that and experienced the very same effect: no QAT activity for >> backend server healthchecks :-( When I add frontend eg. >> >> frontend frontend1 >> bind 127.0.0.1:8443 ssl crt >> /etc/lb_engine/data/generated/ssl/10.252.24.7:443 >> default_backend pool_all >> >> and make some connections/requests (TLS1.2 and/or TLS/1.3) to the frontend I >> see QAT activity, but *NO* activity when HAProxy is "idle" (only doing >> healthchecks to backend servers: TLS 1.2 only). >> This feels like healthchecks are not passing through QAT engine for whatever >> reason :-( Even enabling HTTP check for the backend (option httpchk) does >> not make any difference. >> The question: Is SSL Async Mode actually supported on the backend side >> (either healthchecks and/or normal traffic) ? >> Regards, > > Mode async is enabled on both sides, server and frontend side. > > But on server side, haproxy is using session resuming, so there is a new key > computation (full handshake with RSA/DSA computation) only every 5 minutes > (openssl default value). > > You can force to recompute each time setting "no-ssl-reuse" on server line, > but it will add a heavy load for ssl computation on the server. > > > R, > Emeric > I've just realized that what you observe is the expected behavior: QAT offloads on the frontend side, and this is what we want: to offload on QAT the heavy load of key computing on the frontend side (the support of async engines in haproxy was added for this reason). On the backend side, haproxy acts as a client, re-using session and even if a key is re-computed by the server, the cost of processing on the haproxy's backend side is much lower compared to frontend side, perhaps it is not even implemented into QAT. Once again, you could add the "no-ssl-reuse" statement if you want to check if QAT offloads the backend side, but it is clearly not an optimal option for production because it will generate an heavy load on your servers and force them to recompute keys for each connections. R, Emeric
Re: QAT intermittent healthcheck errors
On 3/11/19 11:06 AM, Marcin Deranek wrote: > Hi Emeric, > > On 3/8/19 11:24 AM, Emeric Brun wrote: >> Are you sure that servers won't use ECDSA certificates? Do you check that >> conn are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384' > > Backend servers only support TLS 1.2 and RSA certificates. > >> Could you check algo supported by QAT doing this ?: >> openssl engine -c qat > > # /opt/booking-openssl/bin/openssl engine -c qat > (qat) Reference implementation of QAT crypto engine > [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, > AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF] > >> Could you retry with this config: >> ssl-engine qat algo RSA,DSA,EC,DH > > Just did that and experienced the very same effect: no QAT activity for > backend server healthchecks :-( When I add frontend eg. > > frontend frontend1 > bind 127.0.0.1:8443 ssl crt > /etc/lb_engine/data/generated/ssl/10.252.24.7:443 > default_backend pool_all > > and make some connections/requests (TLS1.2 and/or TLS/1.3) to the frontend I > see QAT activity, but *NO* activity when HAProxy is "idle" (only doing > healthchecks to backend servers: TLS 1.2 only). > This feels like healthchecks are not passing through QAT engine for whatever > reason :-( Even enabling HTTP check for the backend (option httpchk) does not > make any difference. > The question: Is SSL Async Mode actually supported on the backend side > (either healthchecks and/or normal traffic) ? > Regards, Mode async is enabled on both sides, server and frontend side. But on server side, haproxy is using session resuming, so there is a new key computation (full handshake with RSA/DSA computation) only every 5 minutes (openssl default value). You can force to recompute each time setting "no-ssl-reuse" on server line, but it will add a heavy load for ssl computation on the server. R, Emeric
Re: QAT intermittent healthcheck errors
Hi Emeric, On 3/8/19 11:24 AM, Emeric Brun wrote: Are you sure that servers won't use ECDSA certificates? Do you check that conn are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384' Backend servers only support TLS 1.2 and RSA certificates. Could you check algo supported by QAT doing this ?: openssl engine -c qat # /opt/booking-openssl/bin/openssl engine -c qat (qat) Reference implementation of QAT crypto engine [RSA, DSA, DH, AES-128-CBC-HMAC-SHA1, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA1, AES-256-CBC-HMAC-SHA256, TLS1-PRF] Could you retry with this config: ssl-engine qat algo RSA,DSA,EC,DH Just did that and experienced the very same effect: no QAT activity for backend server healthchecks :-( When I add frontend eg. frontend frontend1 bind 127.0.0.1:8443 ssl crt /etc/lb_engine/data/generated/ssl/10.252.24.7:443 default_backend pool_all and make some connections/requests (TLS1.2 and/or TLS/1.3) to the frontend I see QAT activity, but *NO* activity when HAProxy is "idle" (only doing healthchecks to backend servers: TLS 1.2 only). This feels like healthchecks are not passing through QAT engine for whatever reason :-( Even enabling HTTP check for the backend (option httpchk) does not make any difference. The question: Is SSL Async Mode actually supported on the backend side (either healthchecks and/or normal traffic) ? Regards, Marcin Deranek
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 3/8/19 4:43 PM, Emeric Brun wrote: I've just realized that if your server are TLSv1.3 ssl-default-server-ciphers won't force anything (see ssl-default-server-ciphersuites documentation) Backend servers are 'only' TLS 1.2, so it should have desired effect. Will test suggested configuration changes and report shortly. Marcin Deranek
Re: QAT intermittent healthcheck errors
Hi Marcin, On 3/7/19 6:43 PM, Marcin Deranek wrote: > Hi, > > On 3/6/19 6:36 PM, Emeric Brun wrote: >> According to the documentation: >> >> ssl-mode-async >> Adds SSL_MODE_ASYNC mode to the SSL context. This enables asynchronous TLS >> I/O operations if asynchronous capable SSL engines are used. The current >> implementation supports a maximum of 32 engines. The Openssl ASYNC API >> doesn't support moving read/write buffers and is not compliant with >> haproxy's buffer management. So the asynchronous mode is disabled on >> read/write operations (it is only enabled during initial and reneg >> handshakes). >> >> Asynchronous mode is disabled on the read/write operation and is only >> enabled during handshake. >> >> It means that for the ciphering process the engine will be used in blocking >> mode (not async) which could result to >> unpredictable behavior on timers because the haproxy process will >> sporadically fully blocked waiting for the engine. >> >> To avoid this issue, you should ensure to use QAT only for the asymmetric >> computing algorithm (such as RSA DSA ECDSA). >> and not for ciphering ones (AES and everything else ...) > > I did explicitly enabled RSA algos: > > ssl-engine qat algo RSA > > and errors were gone at that point. Unfortunately all QAT activity too as > > /sys/kernel/debug/qat_c6xx_\:0*/fw_counters > > were reporting identical values (previously they were incrementing). > > I did explicitly enforce RSA: > > ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA384 I've just realized that if your server are TLSv1.3 ssl-default-server-ciphers won't force anything (see ssl-default-server-ciphersuites documentation) R, Emeric
Re: QAT intermittent healthcheck errors
Hi Marcin, On 3/7/19 6:43 PM, Marcin Deranek wrote: > Hi, > > On 3/6/19 6:36 PM, Emeric Brun wrote: >> According to the documentation: >> >> ssl-mode-async >> Adds SSL_MODE_ASYNC mode to the SSL context. This enables asynchronous TLS >> I/O operations if asynchronous capable SSL engines are used. The current >> implementation supports a maximum of 32 engines. The Openssl ASYNC API >> doesn't support moving read/write buffers and is not compliant with >> haproxy's buffer management. So the asynchronous mode is disabled on >> read/write operations (it is only enabled during initial and reneg >> handshakes). >> >> Asynchronous mode is disabled on the read/write operation and is only >> enabled during handshake. >> >> It means that for the ciphering process the engine will be used in blocking >> mode (not async) which could result to >> unpredictable behavior on timers because the haproxy process will >> sporadically fully blocked waiting for the engine. >> >> To avoid this issue, you should ensure to use QAT only for the asymmetric >> computing algorithm (such as RSA DSA ECDSA). >> and not for ciphering ones (AES and everything else ...) > > I did explicitly enabled RSA algos: > > ssl-engine qat algo RSA > > and errors were gone at that point. Unfortunately all QAT activity too as > > /sys/kernel/debug/qat_c6xx_\:0*/fw_counters > > were reporting identical values (previously they were incrementing). > > I did explicitly enforce RSA: > > ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA384 > > but that did not help. Do I miss something ? > Regards, > > Marcin Deranek > Are you sure that servers won't use ECDSA certificates? Do you check that conn are successful forcing 'ECDHE-RSA-AES256-GCM-SHA384' Could you check algo supported by QAT doing this ?: openssl engine -c qat Could you retry with this config: ssl-engine qat algo RSA,DSA,EC,DH R, Emeric
Re: QAT intermittent healthcheck errors
Hi, On 3/6/19 6:36 PM, Emeric Brun wrote: According to the documentation: ssl-mode-async Adds SSL_MODE_ASYNC mode to the SSL context. This enables asynchronous TLS I/O operations if asynchronous capable SSL engines are used. The current implementation supports a maximum of 32 engines. The Openssl ASYNC API doesn't support moving read/write buffers and is not compliant with haproxy's buffer management. So the asynchronous mode is disabled on read/write operations (it is only enabled during initial and reneg handshakes). Asynchronous mode is disabled on the read/write operation and is only enabled during handshake. It means that for the ciphering process the engine will be used in blocking mode (not async) which could result to unpredictable behavior on timers because the haproxy process will sporadically fully blocked waiting for the engine. To avoid this issue, you should ensure to use QAT only for the asymmetric computing algorithm (such as RSA DSA ECDSA). and not for ciphering ones (AES and everything else ...) I did explicitly enabled RSA algos: ssl-engine qat algo RSA and errors were gone at that point. Unfortunately all QAT activity too as /sys/kernel/debug/qat_c6xx_\:0*/fw_counters were reporting identical values (previously they were incrementing). I did explicitly enforce RSA: ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA384 but that did not help. Do I miss something ? Regards, Marcin Deranek
Re: QAT intermittent healthcheck errors
Hi, On 3/6/19 6:36 PM, Emeric Brun wrote: To avoid this issue, you should ensure to use QAT only for the asymmetric computing algorithm (such as RSA DSA ECDSA). and not for ciphering ones (AES and everything else ...) The ssl engine statement allow you to filter such algos: ssl-engine [algo ] I'm pretty sure I tried this, but I will try to re-test again with eg. RSA specified and see if that makes any difference. Regards, Marcin Deranek
Re: QAT intermittent healthcheck errors
Hi Marcin, On 3/6/19 3:23 PM, Marcin Deranek wrote: > Hi, > > In a process of evaluating performance of Intel Quick Assist Technology in > conjunction with HAProxy software I acquired Intel C62x Chipset card for > testing. I configured QAT engine in the following manner: > > * /etc/qat/c6xx_dev[012].conf > > [GENERAL] > ServicesEnabled = cy > ConfigVersion = 2 > CyNumConcurrentSymRequests = 512 > CyNumConcurrentAsymRequests = 64 > statsGeneral = 1 > statsDh = 1 > statsDrbg = 1 > statsDsa = 1 > statsEcc = 1 > statsKeyGen = 1 > statsDc = 1 > statsLn = 1 > statsPrime = 1 > statsRsa = 1 > statsSym = 1 > KptEnabled = 0 > StorageEnabled = 0 > PkeServiceDisabled = 0 > DcIntermediateBufferSizeInKB = 64 > > [KERNEL] > NumberCyInstances = 0 > NumberDcInstances = 0 > > [SHIM] > NumberCyInstances = 1 > NumberDcInstances = 0 > NumProcesses = 16 > LimitDevAccess = 0 > > Cy0Name = "UserCY0" > Cy0IsPolled = 1 > Cy0CoreAffinity = 0 > > OpenSSL produces good results without warnings / errors: > > * No QAT involved > > $ openssl speed -elapsed rsa2048 > You have chosen to measure elapsed time instead of user CPU time. > Doing 2048 bits private rsa's for 10s: 10858 2048 bits private RSA's in 10.00s > Doing 2048 bits public rsa's for 10s: 361207 2048 bits public RSA's in 10.00s > OpenSSL 1.1.1a FIPS 20 Nov 2018 > built on: Tue Jan 22 20:43:41 2019 UTC > options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) > blowfish(ptr) > compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g -pipe > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong > --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic > -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC > -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT > -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM > -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM > -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM > -DPOLY1305_ASM -DZLIB -DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" > -DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config" > sign verify sign/s verify/s > rsa 2048 bits 0.000921s 0.28s 1085.8 36120.7 > > * QAT enabled > > $ openssl speed -elapsed -engine qat -async_jobs 32 rsa2048 > engine "qat" set. > You have chosen to measure elapsed time instead of user CPU time. > Doing 2048 bits private rsa's for 10s: 205425 2048 bits private RSA's in > 10.00s > Doing 2048 bits public rsa's for 10s: 2150270 2048 bits public RSA's in 10.00s > OpenSSL 1.1.1a FIPS 20 Nov 2018 > built on: Tue Jan 22 20:43:41 2019 UTC > options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) > blowfish(ptr) > compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g -pipe > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong > --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic > -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC > -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT > -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM > -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM > -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM > -DPOLY1305_ASM -DZLIB -DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" > -DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config" > sign verify sign/s verify/s > rsa 2048 bits 0.49s 0.05s 20542.5 215027.0 > > So far so good. Unfortunately HAProxy 1.8 iwth QAT engine enabled > periodically fail with SSL checks of backend servers. The simplest > configuration I could get to reproduce it: > > * /etc/haproxy/haproxy.cfg > > global > user lbengine > group lbengine > daemon > ssl-mode-async > ssl-engine qat > ssl-server-verify none > stats socket /run/lb_engine/process-1.sock user lbengine group > lbengine mode 660 level admin expose-fd listeners process 1 > > defaults > mode http > timeout check 5s > timeout connect 4s > > backend pool_all > default-server inter 5s > > server server1 ip1:443 check ssl > server server2 ip2:443 check ssl > ... > server serverN ipN:443 check ssl > > Without QAT enabled everything works just fine - healthchecks do not flap. > With QAT engine enabled random server healtchecks flap: they fail and then > shortly after they recover eg. > > 2019-03-06T15:06:22+01:00 localhost hapee-lb[1832]: Server pool_all/server1 > is DOWN, reason: Layer6 timeout, check duration: 4000ms. 110 active and 0 > backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. > 2019-03-06T15:06:32+01:00 localhost hapee-lb[1832]: Server pool_all/server1 > is UP, reason: Layer6 check passed, check duration: 13ms. 117 active and 0 > backup servers
QAT intermittent healthcheck errors
Hi, In a process of evaluating performance of Intel Quick Assist Technology in conjunction with HAProxy software I acquired Intel C62x Chipset card for testing. I configured QAT engine in the following manner: * /etc/qat/c6xx_dev[012].conf [GENERAL] ServicesEnabled = cy ConfigVersion = 2 CyNumConcurrentSymRequests = 512 CyNumConcurrentAsymRequests = 64 statsGeneral = 1 statsDh = 1 statsDrbg = 1 statsDsa = 1 statsEcc = 1 statsKeyGen = 1 statsDc = 1 statsLn = 1 statsPrime = 1 statsRsa = 1 statsSym = 1 KptEnabled = 0 StorageEnabled = 0 PkeServiceDisabled = 0 DcIntermediateBufferSizeInKB = 64 [KERNEL] NumberCyInstances = 0 NumberDcInstances = 0 [SHIM] NumberCyInstances = 1 NumberDcInstances = 0 NumProcesses = 16 LimitDevAccess = 0 Cy0Name = "UserCY0" Cy0IsPolled = 1 Cy0CoreAffinity = 0 OpenSSL produces good results without warnings / errors: * No QAT involved $ openssl speed -elapsed rsa2048 You have chosen to measure elapsed time instead of user CPU time. Doing 2048 bits private rsa's for 10s: 10858 2048 bits private RSA's in 10.00s Doing 2048 bits public rsa's for 10s: 361207 2048 bits public RSA's in 10.00s OpenSSL 1.1.1a FIPS 20 Nov 2018 built on: Tue Jan 22 20:43:41 2019 UTC options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DZLIB -DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" -DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config" signverifysign/s verify/s rsa 2048 bits 0.000921s 0.28s 1085.8 36120.7 * QAT enabled $ openssl speed -elapsed -engine qat -async_jobs 32 rsa2048 engine "qat" set. You have chosen to measure elapsed time instead of user CPU time. Doing 2048 bits private rsa's for 10s: 205425 2048 bits private RSA's in 10.00s Doing 2048 bits public rsa's for 10s: 2150270 2048 bits public RSA's in 10.00s OpenSSL 1.1.1a FIPS 20 Nov 2018 built on: Tue Jan 22 20:43:41 2019 UTC options:bn(64,64) md2(char) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wa,--noexecstack -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DZLIB -DNDEBUG -DPURIFY -DDEVRANDOM="\"/dev/urandom\"" -DSYSTEM_CIPHERS_FILE="/opt/openssl/etc/crypto-policies/back-ends/openssl.config" signverifysign/s verify/s rsa 2048 bits 0.49s 0.05s 20542.5 215027.0 So far so good. Unfortunately HAProxy 1.8 iwth QAT engine enabled periodically fail with SSL checks of backend servers. The simplest configuration I could get to reproduce it: * /etc/haproxy/haproxy.cfg global user lbengine group lbengine daemon ssl-mode-async ssl-engine qat ssl-server-verify none stats socket /run/lb_engine/process-1.sock user lbengine group lbengine mode 660 level admin expose-fd listeners process 1 defaults mode http timeout check 5s timeout connect 4s backend pool_all default-server inter 5s server server1 ip1:443 check ssl server server2 ip2:443 check ssl ... server serverN ipN:443 check ssl Without QAT enabled everything works just fine - healthchecks do not flap. With QAT engine enabled random server healtchecks flap: they fail and then shortly after they recover eg. 2019-03-06T15:06:22+01:00 localhost hapee-lb[1832]: Server pool_all/server1 is DOWN, reason: Layer6 timeout, check duration: 4000ms. 110 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. 2019-03-06T15:06:32+01:00 localhost hapee-lb[1832]: Server pool_all/server1 is UP, reason: Layer6 check passed, check duration: 13ms. 117 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Increasing check frequency (lowering check interval) makes the problem occur more frequently. Anybody has a clue why this is happening ? Has anybody seen such behavior ? Regards, Marcin Deranek