RE: Reduced performance after restart because of replication inconsistency

Windl, Ulrich Mon, 18 Aug 2025 01:59:46 -0700

Hi!

I'm definitely not an expert (see also 
https://serverfault.com/q/1177576/407952), but the symptom seems to indicate to 
me that either the servers were not shut down cleanly (I guess one was 
shutdown, patched, and then started, and then the other one), or there were 
local changes to each server before replication was working. Now it seems you 
need a manual content sync.
Your replication log isn't full, BTW? Also what is the cumulative size of your 
MDB databases compared to the RAM you have?


I think (see disclaimer at start) shutting down one server would be enough for 
manual content sync:
Shutdown the server, delete the database and changelog database. Import 
(slapadd) a rather current export from the other server (which options to use, 
BTW?). This will grow the changlog tremendously.
I'm unsure whether you should delete the changelog database again before 
restarting the node or not; maybe experts can tell.
Then the newly started node should pull the outstanding changes from the other 
node.

Kind regards,
Ulrich Windl

> -----Original Message-----
> From: Bergmann, Clemens <clemens.bergm...@tu-darmstadt.de>
> Sent: Thursday, August 14, 2025 1:33 PM
> To: openldap-technical@openldap.org
> Subject: [EXT] Reduced performance after restart because of replication
> inconsistency
> 
> Hi,
> 
> since some weeks we operate an openldap deployment in N-Way Multi-
> Provider Delta-syncrepl Replication. There are around 130000 entries in the
> main database, and it is around 635 MB in size. Currently the replication
> contains two VMs each with 4Gb of RAM and 2 CPUs. The VMs are running
> slapd 2.6.10. I posted the configuration on [1]. I only removed the 
> credentials
> and some user specific ACLs.
> 
> This setup worked flawlessly for some time until both servers rebooted
> during a scheduled patch circle. Since then, we see drastically increased
> response times and CPU utilization on both VMs. On one of the servers
> (ldap08) I see the following Log entry every few seconds:
>   do_syncrep2: rid=910 (4096) Content Sync Refresh Required
> 
> When I try to compare the contextCSN of both servers they differ a little but
> only less then 5 seconds at max. The however change constantly because
> these servers are used for login and store the last login and Intruder
> detection information which must be replicated. Most of the other data is
> static but there are some changes (changed passwords, Name changes)
> every few minutes or so. For Example a few minutes ago we had the
> following values:
> 
> 20250717103207.135309Z#000000#000#000000
> 20250717115153.689217Z#000000#06a#000000
> 20250814105455.935611Z#000000#06b#000000
> 20250814105522.282937Z#000000#06c#000000
> 
> I understand that the 000 entry is from before enabling replication and the
> 06a value is from an old server no longer belonging to this replication
> (ldap05).
> 
> When I last had a situation like this the servers where not in production and 
> I
> shut both down, copied the database over and started them up again. This is
> not an option now as they are in production and needed for login. Also
> preventing updates for longer than a few minutes is not an option an even
> this has to be announced ahead of time. When I last tried adding a brand
> new Server cluster configured in a similar way in testing it took multiple 
> hours
> to get the new server up to speed. I fear that removing one of the two
> servers for multiple hours would overwhelm the remaining server with
> requests. In the future the plan is to have at least 3 Servers in this 
> replication
> but currently we only have two. It is however an option to prepare new
> server(s) and add them to the replication if that might help somehow.
> 
> One other information is that currently the accesslog database is around 2 GB
> of size.
> 
> What would be the best approach to remediate this situation?
> 
> [1] https://next.hessenbox.de/index.php/s/jFX9gAEWXoqoxNS
> 
> 
> Mit freundlichen Grüßen
> Clemens (Bergmann)
> 
> --
> Clemens Bergmann
> [er/ihm; he/him]
> Gruppe Nutzermanagement und Entwicklung
> Technische Universität Darmstadt
> Hochschulrechenzentrum, Alexanderstraße 2, 64283 Darmstadt
> Tel. +49 6151 16 71184
> http://www.hrz.tu-darmstadt.de/

RE: Reduced performance after restart because of replication inconsistency

Reply via email to