Hi Marcin,
Do you have ssl enabled on the server side? If it is the case could replace
health check with a simple tcp check (without ssl)?
Regarding the show info/lsoff it seems there is no more sessions on client
side but remaining ssl jobs (CurrSslConns) and I supsect the health checks to
miss a cleanup of their ssl sessions using the QAT. (this is just an
assumption)
R,
Emeric
On 4/12/19 4:43 PM, Marcin Deranek wrote:
> Hi Emeric,
>
> On 4/10/19 2:20 PM, Emeric Brun wrote:
>
>> On 4/10/19 1:02 PM, Marcin Deranek wrote:
>>> Hi Emeric,
>>>
>>> Our process limit in QAT configuration is quite high (128) and I was able
>>> to run 100+ openssl processes without a problem. According to Joel from
>>> Intel problem is in cleanup code - presumably when HAProxy exits and frees
>>> up QAT resources. Will try to see if I can get more debug information.
>>
>> I've just take a look.
>>
>> Engines deinit ar called:
>>
>> haproxy/src/ssl_sock.c
>> #ifndef OPENSSL_NO_ENGINE
>> void ssl_free_engines(void) {
>> struct ssl_engine_list *wl, *wlb;
>> /* free up engine list */
>> list_for_each_entry_safe(wl, wlb, &openssl_engines, list) {
>> ENGINE_finish(wl->e);
>> ENGINE_free(wl->e);
>> LIST_DEL(&wl->list);
>> free(wl);
>> }
>> }
>> #endif
>> ...
>> #ifndef OPENSSL_NO_ENGINE
>> hap_register_post_deinit(ssl_free_engines);
>> #endif
>>
>> I don't know how many haproxy processes you are running but if I describe
>> the complete scenario of processes you may note that we reach a limit:
>
> It's very unlikely it's the limit as I lowered number of HAProxy processes
> (from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have
> problem with this limit while spawning new instances and not tearing down old
> ones. In such a case QAT would not be initialized for some HAProxy instances
> (you would see 1 thread vs 2 thread). About threads read below.
>
>> - the master sends a signal to older processes, those process will unbind
>> and stop to accept new conns but continue to serve remaining sessions until
>> the end.
>> - new processes are started and immediately and init the engine and accept
>> newconns.
>> - When no more sessions remains on an old process, it calls the deinit
>> function of the engine before exiting
>
> What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) -
> looks like QAT adds extra thread to the process itself. Would adding extra
> thread possibly mess up HAProxy termination sequence ?
> Our setup is to run HAProxy in multi process mode - no threads (or 1 thread
> per process if you wish).
>
>> I'm also supposed that old processes are stucked because there is some
>> sessions which never ended, perhaps I'm wrong but a strace on an old process
>> could be interesting to know why those processes are stucked.
>
> strace only shows these:
>
> [pid 11392] 23:24:43.164619 epoll_wait(4, <unfinished ...>
> [pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0
> [pid 11392] 23:24:43.164761 epoll_wait(4, <unfinished ...>
> [pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0
> [pid 11392] 23:24:43.953286 epoll_wait(4, <unfinished ...>
> [pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0
> [pid 11392] 23:24:43.953419 epoll_wait(4, <unfinished ...>
> [pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0
> [pid 11392] 23:24:44.010589 epoll_wait(4, <unfinished ...>
>
> There are no connections: stucked process only has UDP socket on random port:
>
> [root@externallb-124 ~]# lsof -p 6307|fgrep IPv4
> hapee-lb 6307 lbengine 83u IPv4 3598779351 0t0 UDP *:19573
>
>
>> You can also use the 'master CLI' using '-S' and you could check if it
>> remains sessions on those older processes (doc is available in
>> management.txt)
>
> Before reload
> * systemd
> Main PID: 33515 (hapee-lb)
> Memory: 1.6G
> CGroup: /system.slice/hapee-1.8-lb.service
> ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> └─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> * master CLI
> show proc
> #<PID> <type> <relative PID> <reloads> <uptime>
> 33515 master 0 0 0d 00h00m31s
> # workers
> 34858 worker 1 0 0d 00h00m31s
> 34859 worker 2 0 0d 00h00m31s
> 34860 worker 3 0 0d 00h00m31s
> 34861 worker 4 0 0d 00h00m31s
>
> After reload:
> * systemd
> Main PID: 33515 (hapee-lb)
> Memory: 3.1G
> CGroup: /system.slice/hapee-1.8-lb.service
> ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858
> 34859 34860 34861 -x /run/lb_engine/process-1.sock
> ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
> ├─41871 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858
> 34859 34860 34861 -x /run/lb_engine/process-1.sock
> ├─41872 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858
> 34859 34860 34861 -x /run/lb_engine/process-1.sock
> ├─41873 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858
> 34859 34860 34861 -x /run/lb_engine/process-1.sock
> └─41874 /opt/hapee-1.8/sbin/hapee-lb -Ws -f
> /etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858
> 34859 34860 34861 -x /run/lb_engine/process-1.sock
> * master CLI
> show proc
> #<PID> <type> <relative PID> <reloads> <uptime>
> 33515 master 0 1 0d 00h01m33s
> # workers
> 41871 worker 1 0 0d 00h00m45s
> 41872 worker 2 0 0d 00h00m45s
> 41873 worker 3 0 0d 00h00m45s
> 41874 worker 4 0 0d 00h00m45s
> # old workers
> 34858 worker [was: 1] 1 0d 00h01m33s
> 34859 worker [was: 2] 1 0d 00h01m33s
> 34860 worker [was: 3] 1 0d 00h01m33s
> 34861 worker [was: 4] 1 0d 00h01m33s
>
> and
>
> @!34858 show info
> Name: HAProxy
> Version: 1.8.0-2.0.0-195.793
> Release_date: 2019/03/19
> Nbthread: 1
> Nbproc: 4
> Process_num: 1
> Pid: 34858
> Uptime: 0d 0h03m24s
> Uptime_sec: 204
> Memmax_MB: 0
> PoolAlloc_MB: 1
> PoolUsed_MB: 1
> PoolFailed: 0
> Ulimit-n: 2006423
> CurrConns: 0
> CumConns: 354
> CumReq: 342
> CurrSslConns: 20
> CumSslConns: 35928
> Maxpipes: 0
> PipesUsed: 0
> PipesFree: 0
> ConnRate: 0
> ConnRateLimit: 0
> MaxConnRate: 65
> SessRate: 0
> SessRateLimit: 0
> MaxSessRate: 62
> SslRate: 0
> SslRateLimit: 0
> MaxSslRate: 52
> SslFrontendKeyRate: 0
> SslFrontendMaxKeyRate: 52
> SslFrontendSessionReuse_pct: 0
> SslBackendKeyRate: 0
> SslBackendMaxKeyRate: 2988
> SslCacheLookups: 0
> SslCacheMisses: 0
> CompressBpsIn: 0
> CompressBpsOut: 0
> CompressBpsRateLim: 0
> Tasks: 5849
> Run_queue: 1
> Idle_pct: 100
> Stopping: 1
> Jobs: 25
> Unstoppable Jobs: 4
> Listeners: 4
> DroppedLogs: 0
>
> Regards,
>
> Marcin Deranek