Hi,
we got a quite strange behaviour in which a slapd server stops
processing connections for some tens of seconds while a single thread is
running 100% on a single CPU and all other CPU are almost idle.
When the problem arise there is no significant iowait or disk I/O (and
no swapping, that's disabled). Context switches just go near zero (from
some tens of thousand to some hundreds). Load average is almost always
under 2.
The server has 32G of RAM and 4 HT processors, is running
openldap-2.4.54 in mirror mode (but no delta replication) using the mdb
backend. The same behaviour was found also with 2.4.53. OpenLDAP is the
only service running on it, apart SSH and some monitoring tools.
Database maxsize is 25G around 17G are used.
I'm attaching a redacted configuration of the main server (the secondary
one is the same, with IDs reverted for mirror mode use)
Most of the time it works just fine, processing a up to a few thousand
of read query per second while having some tens of write per second.
Connections are managed by HA-proxy, sending them to this server by
default (used as main node). Many times these stop are short (around 10
second) and we don't lost connections, but when the problem arise and
last for enough time, HAproxy switch to the second node, and we got
downtimes. Staying with the secondary node we have the same behaviour.
The problem manifests itself without periodicity and looking on the
number of connection before it we could not see any usage peak. We tried
to strace slapd threads during the problem, and they seem blocked on a
mutex waiting for the one running at 100% (in a single CPU, user time).
I'm attaching a top results during one of these events.
>From the behaviour I was suspecting (just a wild and uninformated guess)
some indexing issue, blocking all access.
We tried to change tool-threads to 4 because I found it cited in some
example as related to threads used for indexing, but the change has no
effect. Re-reading last version of man-page, if I understand it
correctly, it's effective only for slapadd etc.
So a first question is: there is any other configuration parameter about
indexing that I can try?
Anyway I'm not sure if there is an effective indexing issue (indexes are
quite basic). I was suspecting this because there are lot of writes, and
there is no strace activity during the stop. I should look somewhere else?
Any suggestion on further checks or configuration changes will be more
than appreciated.
Regards
Simone
#
# See slapd.conf(5) for details on configuration options.
# This file should NOT be world readable.
#
include /usr/local/openldap/etc/openldap/schema/corba.schema
include /usr/local/openldap/etc/openldap/schema/core.schema
include /usr/local/openldap/etc/openldap/schema/cosine.schema
include /usr/local/openldap/etc/openldap/schema/duaconf.schema
include /usr/local/openldap/etc/openldap/schema/dyngroup.schema
include /usr/local/openldap/etc/openldap/schema/inetorgperson.schema
include /usr/local/openldap/etc/openldap/schema/java.schema
include /usr/local/openldap/etc/openldap/schema/misc.schema
include /usr/local/openldap/etc/openldap/schema/nis.schema
include /usr/local/openldap/etc/openldap/schema/openldap.schema
include /usr/local/openldap/etc/openldap/schema/ppolicy.schema
include /usr/local/openldap/etc/openldap/schema/collective.schema
#add OurOrganization schema
include /usr/local/openldap/etc/openldap/schema/OurOrganization.schema
# Allow LDAPv2 client connections. This is NOT the default.
allow bind_v2
# This is for mirrormode replication
serverID 11
# Global ACLs
include /usr/local/openldap/etc/openldap/acls/global.acl
# Do not enable referrals until AFTER you have a working directory
# service AND an understanding of referrals.
#referral ldap://root.openldap.org
pidfile /usr/local/openldap/var/run/slapd.pid
argsfile /usr/local/openldap/var/run/slapd.args
# options: none sync parse shell stats2 stats ACL config filter BER conns args
packets trace any
# https://www.openldap.org/doc/admin24/slapdconfig.html
#loglevel none
#loglevel stats sync
loglevel stats
#loglevel none
#loglevel any
# The next three lines allow use of TLS for encrypting connections using a
# dummy test certificate which you can generate by running
# /usr/libexec/openldap/generate-server-cert.sh. Your client software may balk
# at self-signed certificates, however.
TLSCACertificatePath /usr/local/openldap/etc/openldap/certs
TLSCACertificateFile /usr/local/openldap/etc/openldap/certs/rootCA.pem
TLSCertificateFile /usr/local/openldap/etc/openldap/certs/server.crt
TLSCertificateKeyFile /usr/local/openldap/etc/openldap/certs/server.key
#TLSCertificateFile /etc/pki/tls/certs/ldap1_pubkey.pem
#TLSCertificateKeyFile /etc/pki/tls/certs/ldap1_privkey.pem
sizelimit 250000
# Setup the idle timeout to prevent app servers from taking down ldap.
# logout idle clients after 30 seconds
idletimeout 10
#######################################################################
# database definitions
#######################################################################
#######################################################################
# Monitor
#######################################################################
database monitor
include /usr/local/openldap/etc/openldap/acls/monitor.acl
rootdn "uid=monitor,cn=Monitor"
rootpw ZZZ
#######################################################################
# Database specific directives apply to this databasse until another
# 'database' directive occurs
#######################################################################
database mdb
suffix "o=ourorg"
# Where the database file are physically stored for database
#directory /usr/local/openldap/var/openldap-data
directory /data/openldap-data
rootdn "uid=root,cn=special,o=ourorg"
rootpw {SSHA}XXX
monitoring on
maxsize 25769803776
envflags writemap nometasync
# Ourorg settings: we want uid,cn, and uniqueMember indexed
# Indexing options for database
index uid eq
index cn eq
index objectClass eq
index uniqueMember eq
index entryCSN,entryUUID eq
tool-threads 4
#########################################################################
# FST db specific ACLs
#########################################################################
include /usr/local/openldap/etc/openldap/acls/fst.acl
# Give unlimited access to search this database for syncrepl
limits dn.exact="uid=syncuser,cn=special,o=ourorg"
size.hard=unlimited
size.soft=unlimited
time.hard=unlimited
time.soft=unlimited
limits dn.exact="uid=slaveuser,cn=special,o=ourorg"
size.hard=unlimited
size.soft=unlimited
time.hard=unlimited
time.soft=unlimited
# Syncrepl Provider for ourorg db
overlay syncprov
# update the contextCSN in the database after either
# 100 successful write operations OR
# more than 10 minutes have elapsed
# since the last time the contextCSN was written to the database
syncprov-checkpoint 100 10
# Syncrepl provider maintains a record of last 100 successful write operations
# The current design of the session log store is memory based
syncprov-sessionlog 100
############################################################################
# Syncrepl consumer directives
############################################################################
syncrepl rid=12
provider=ldaps://ldp-12.ourorg.org
tls_reqcert=never
bindmethod=simple
binddn="uid=syncuser,cn=special,o=ourorg"
credentials=YYY
searchbase="o=ourorg"
schemachecking=on
type=refreshAndPersist
retry="60 +"
#############################################################################
# MirrorMode setup
#############################################################################
mirrormode on
# The lastmod overlay dynamically generates an entry with RDN "cn=Lastmod",
rooted
# at the underlying database suffix, that contains the relevant info about the
last
# modification that occurred in the underlying database.
lastmod on
top - 09:25:26 up 14 days, 9:39, 1 user, load average: 0.63, 0.59, 0.57
Tasks: 155 total, 2 running, 99 sleeping, 0 stopped, 1 zombie
Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.3%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32466708k total, 17732364k used, 14734344k free, 438012k buffers
Swap: 0k total, 0k used, 0k free, 15743896k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21439 ldap 20 0 25.6g 12g 12g S 99.8 41.8 5606:40 slapd
24518 root 39 19 7732 5260 884 S 0.7 0.0 1:53.74 apps.plugin
2325 zabbix 20 0 99.2m 3444 2496 R 0.3 0.0 39:01.31 zabbix_agentd
24294 netdata 39 19 154m 82m 2580 S 0.3 0.3 0:58.63 netdata
24512 netdata 39 19 152m 25m 7196 S 0.3 0.1 0:12.71 python
29208 spiccard 20 0 15368 2308 1956 R 0.3 0.0 0:00.02 top
1 root 20 0 19696 2580 2256 S 0.0 0.0 0:01.61 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq