RE: Reduced performance after restart because of replication inconsistency

Bergmann, Clemens Mon, 18 Aug 2025 06:51:24 -0700

Hi Ondřej,

thanks fort he tips.
Just to make sure I am not misunderstanding something fundamental:
The accesslog Database has the syncprov overlay configured for access from the 
other servers but no olcSyncrepl attribute. It is referenced as 'logbase' in 
the olcSyncrepl attribute of the main database. I understood that the accesslog 
database is a local database to help with synchronization and therefore not 
itself synchronized. Is this correct?


You can see my "full" (minus credentials) config under [1] if I missed 
providing some relevant information.

[1] https://next.hessenbox.de/index.php/s/jFX9gAEWXoqoxNS

Mit freundlichen Grüßen
Clemens (Bergmann)

-- 
Clemens Bergmann
[er/ihm; he/him]
Gruppe Nutzermanagement und Entwicklung
Technische Universität Darmstadt
Hochschulrechenzentrum, Alexanderstraße 2, 64283 Darmstadt
Tel. +49 6151 16 71184
http://www.hrz.tu-darmstadt.de/

> -----Original Message-----
> From: Ondřej Kuzník <on...@mistotebe.net>
> Sent: Montag, 18. August 2025 11:59
> To: Bergmann, Clemens <clemens.bergm...@tu-darmstadt.de>
> Cc: openldap-technical@openldap.org
> Subject: Re: Reduced performance after restart because of replication
> inconsistency
> 
> On Thu, Aug 14, 2025 at 11:32:45AM +0000, Bergmann, Clemens wrote:
> > Hi,
> >
> > since some weeks we operate an openldap deployment in N-Way
> > Multi-Provider Delta-syncrepl Replication. There are around 130000
> > entries in the main database, and it is around 635 MB in size.
> > Currently the replication contains two VMs each with 4Gb of RAM and 2
> > CPUs. The VMs are running slapd 2.6.10. I posted the configuration on
> > [1]. I only removed the credentials and some user specific ACLs.
> >
> > This setup worked flawlessly for some time until both servers rebooted
> > during a scheduled patch circle. Since then, we see drastically
> > increased response times and CPU utilization on both VMs. On one of
> > the servers (ldap08) I see the following Log entry every few seconds:
> >   do_syncrep2: rid=910 (4096) Content Sync Refresh Required
> 
> Hi Clemens,
> check your contextCSNs and make sure they are all in sync (the DBs *and*
> their corresponding accesslog as well). If your servers (through
> misconfiguration or otherwise) missed a contextCSN update to accesslog
> and that CSN never got updated afterwards, deltasync will have to keep
> falling back to plain syncrepl (client has a cookie indicating "future"
> data).
> 
> > When I try to compare the contextCSN of both servers they differ a
> > little but only less then 5 seconds at max. The however change
> > constantly because these servers are used for login and store the last
> > login and Intruder detection information which must be replicated.
> > Most of the other data is static but there are some changes (changed
> > passwords, Name changes) every few minutes or so. For Example a few
> > minutes ago we had the following values:
> 
> First I would examine I/O status of the server: is there congestion? Is
> that on the read/write side? In general, read side congestion suggests
> low RAM, write side congestion suggests configuration issues or platform
> I/O limitations. Also if you're limited on write I/O, your cluster will
> keep struggling however many servers you spin up.
> 
> If it's write I/O, have a go at reading your accesslog and looking
> whether there is a major source of writes. Things like pwdLastSuccess
> can have their granularity limited which often greatly reduces write
> traffic, etc.
> 
> > When I last had a situation like this the servers where not in
> > production and I shut both down, copied the database over and started
> > them up again. This is not an option now as they are in production and
> > needed for login. Also preventing updates for longer than a few
> > minutes is not an option an even this has to be announced ahead of
> > time. When I last tried adding a brand new Server cluster configured
> > in a similar way in testing it took multiple hours to get the new
> > server up to speed. I fear that removing one of the two servers for
> > multiple hours would overwhelm the remaining server with requests. In
> > the future the plan is to have at least 3 Servers in this replication
> > but currently we only have two. It is however an option to prepare new
> > server(s) and add them to the replication if that might help somehow.
> >
> > One other information is that currently the accesslog database is
> > around 2 GB of size.
> 
> Good, just like Ulrich mentioned, set up monitoring to make sure it
> *never* becomes full. Missed accesslog writes cause hard to recover
> synchronisation errors.
> 
> Regards,
> 
> --
> Ondřej Kuzník
> Senior Software Engineer
> Symas Corporation                       http://www.symas.com
> Packaged, certified, and supported LDAP solutions powered by OpenLDAP

smime.p7s
Description: S/MIME cryptographic signature

RE: Reduced performance after restart because of replication inconsistency

Reply via email to