If I was facing this symptom, I'd capture a couple of pstack <slapd PID> 
outputs when the pb is occurring (and maybe in correlation with perf top -p 
<slapd PID> if pstacks are not enough).
That should help avoiding guesses.

++Cyrille

-----Original Message-----
From: Simone Piccardi [mailto:[email protected]] 
Sent: Tuesday, November 3, 2020 6:41 PM
To: [email protected]
Subject: Connections blocked for some tens of seconds while a single slapd 
thread running 100%

Hi,

we got a quite strange behaviour in which a slapd server stops processing 
connections for some tens of seconds while a single thread is running 100% on a 
single CPU and all other CPU are almost idle.
When the problem arise there is no significant iowait or disk I/O (and no 
swapping, that's disabled). Context switches just go near zero (from some tens 
of thousand to some hundreds). Load average is almost always under 2.

The server has 32G of RAM and 4 HT processors, is running
openldap-2.4.54 in mirror mode (but no delta replication) using the mdb 
backend. The same behaviour was found also with 2.4.53. OpenLDAP is the only 
service running on it, apart SSH and some monitoring tools.
Database maxsize is 25G around 17G are used.

I'm attaching a redacted configuration of the main server (the secondary one is 
the same, with IDs reverted for mirror mode use)

Most of the time it works just fine, processing a up to a few thousand of read 
query per second while having some tens of write per second.
Connections are managed by HA-proxy, sending them to this server by default 
(used as main node). Many times these stop are short (around 10
second) and we don't lost connections, but when the problem arise and last for 
enough time, HAproxy switch to the second node, and we got downtimes. Staying 
with the secondary node we have the same behaviour.

The problem manifests itself without periodicity and looking on the number of 
connection before it we could not see any usage peak. We tried to strace slapd 
threads during the problem, and they seem blocked on a mutex waiting for the 
one running at 100% (in a single CPU, user time).
I'm attaching a top results during one of these events.

>From the behaviour I was suspecting (just a wild and uninformated guess) some 
>indexing issue, blocking all access.

We tried to change tool-threads to 4 because I found it cited in some example 
as related to threads used for indexing, but the change has no effect. 
Re-reading last version of man-page, if I understand it correctly, it's 
effective only for slapadd etc.

So a first question is: there is any other configuration parameter about 
indexing that I can try?

Anyway I'm not sure if there is an effective indexing issue (indexes are quite 
basic). I was suspecting this because there are lot of writes, and there is no 
strace activity during the stop.  I should look somewhere else?

Any suggestion on further checks or configuration changes will be more than 
appreciated.

Regards
Simone

Reply via email to