On Thu, Aug 14, 2025 at 11:32:45AM +0000, Bergmann, Clemens wrote:
> Hi,
> 
> since some weeks we operate an openldap deployment in N-Way
> Multi-Provider Delta-syncrepl Replication. There are around 130000
> entries in the main database, and it is around 635 MB in size.
> Currently the replication contains two VMs each with 4Gb of RAM and 2
> CPUs. The VMs are running slapd 2.6.10. I posted the configuration on
> [1]. I only removed the credentials and some user specific ACLs.
> 
> This setup worked flawlessly for some time until both servers rebooted
> during a scheduled patch circle. Since then, we see drastically
> increased response times and CPU utilization on both VMs. On one of
> the servers (ldap08) I see the following Log entry every few seconds:
>   do_syncrep2: rid=910 (4096) Content Sync Refresh Required

Hi Clemens,
check your contextCSNs and make sure they are all in sync (the DBs *and*
their corresponding accesslog as well). If your servers (through
misconfiguration or otherwise) missed a contextCSN update to accesslog
and that CSN never got updated afterwards, deltasync will have to keep
falling back to plain syncrepl (client has a cookie indicating "future"
data).

> When I try to compare the contextCSN of both servers they differ a
> little but only less then 5 seconds at max. The however change
> constantly because these servers are used for login and store the last
> login and Intruder detection information which must be replicated.
> Most of the other data is static but there are some changes (changed
> passwords, Name changes) every few minutes or so. For Example a few
> minutes ago we had the following values:

First I would examine I/O status of the server: is there congestion? Is
that on the read/write side? In general, read side congestion suggests
low RAM, write side congestion suggests configuration issues or platform
I/O limitations. Also if you're limited on write I/O, your cluster will
keep struggling however many servers you spin up.

If it's write I/O, have a go at reading your accesslog and looking
whether there is a major source of writes. Things like pwdLastSuccess
can have their granularity limited which often greatly reduces write
traffic, etc.

> When I last had a situation like this the servers where not in
> production and I shut both down, copied the database over and started
> them up again. This is not an option now as they are in production and
> needed for login. Also preventing updates for longer than a few
> minutes is not an option an even this has to be announced ahead of
> time. When I last tried adding a brand new Server cluster configured
> in a similar way in testing it took multiple hours to get the new
> server up to speed. I fear that removing one of the two servers for
> multiple hours would overwhelm the remaining server with requests. In
> the future the plan is to have at least 3 Servers in this replication
> but currently we only have two. It is however an option to prepare new
> server(s) and add them to the replication if that might help somehow. 
> 
> One other information is that currently the accesslog database is
> around 2 GB of size.

Good, just like Ulrich mentioned, set up monitoring to make sure it
*never* becomes full. Missed accesslog writes cause hard to recover
synchronisation errors.

Regards,

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Reply via email to