Reduced performance after restart because of replication inconsistency

Bergmann, Clemens Thu, 14 Aug 2025 04:33:15 -0700

Hi,

since some weeks we operate an openldap deployment in N-Way Multi-Provider 
Delta-syncrepl Replication. There are around 130000 entries in the main 
database, and it is around 635 MB in size. Currently the replication contains 
two VMs each with 4Gb of RAM and 2 CPUs. The VMs are running slapd 2.6.10. I 
posted the configuration on [1]. I only removed the credentials and some user 
specific ACLs.


This setup worked flawlessly for some time until both servers rebooted during a 
scheduled patch circle. Since then, we see drastically increased response times 
and CPU utilization on both VMs. On one of the servers (ldap08) I see the 
following Log entry every few seconds:
  do_syncrep2: rid=910 (4096) Content Sync Refresh Required

When I try to compare the contextCSN of both servers they differ a little but 
only less then 5 seconds at max. The however change constantly because these 
servers are used for login and store the last login and Intruder detection 
information which must be replicated. Most of the other data is static but 
there are some changes (changed passwords, Name changes) every few minutes or 
so. For Example a few minutes ago we had the following values:

20250717103207.135309Z#000000#000#000000
20250717115153.689217Z#000000#06a#000000
20250814105455.935611Z#000000#06b#000000
20250814105522.282937Z#000000#06c#000000

I understand that the 000 entry is from before enabling replication and the 06a 
value is from an old server no longer belonging to this replication (ldap05).

When I last had a situation like this the servers where not in production and I 
shut both down, copied the database over and started them up again. This is not 
an option now as they are in production and needed for login. Also preventing 
updates for longer than a few minutes is not an option an even this has to be 
announced ahead of time. When I last tried adding a brand new Server cluster 
configured in a similar way in testing it took multiple hours to get the new 
server up to speed. I fear that removing one of the two servers for multiple 
hours would overwhelm the remaining server with requests. In the future the 
plan is to have at least 3 Servers in this replication but currently we only 
have two. It is however an option to prepare new server(s) and add them to the 
replication if that might help somehow. 

One other information is that currently the accesslog database is around 2 GB 
of size.

What would be the best approach to remediate this situation?

[1] https://next.hessenbox.de/index.php/s/jFX9gAEWXoqoxNS


Mit freundlichen Grüßen
Clemens (Bergmann)

-- 
Clemens Bergmann
[er/ihm; he/him]
Gruppe Nutzermanagement und Entwicklung
Technische Universität Darmstadt
Hochschulrechenzentrum, Alexanderstraße 2, 64283 Darmstadt
Tel. +49 6151 16 71184
http://www.hrz.tu-darmstadt.de/

smime.p7s
Description: S/MIME cryptographic signature

Reduced performance after restart because of replication inconsistency

Reply via email to