On Thu, Aug 14, 2025 at 11:32:45AM +0000, Bergmann, Clemens wrote: > Hi, > > since some weeks we operate an openldap deployment in N-Way > Multi-Provider Delta-syncrepl Replication. There are around 130000 > entries in the main database, and it is around 635 MB in size. > Currently the replication contains two VMs each with 4Gb of RAM and 2 > CPUs. The VMs are running slapd 2.6.10. I posted the configuration on > [1]. I only removed the credentials and some user specific ACLs. > > This setup worked flawlessly for some time until both servers rebooted > during a scheduled patch circle. Since then, we see drastically > increased response times and CPU utilization on both VMs. On one of > the servers (ldap08) I see the following Log entry every few seconds: > do_syncrep2: rid=910 (4096) Content Sync Refresh Required
Hi Clemens, check your contextCSNs and make sure they are all in sync (the DBs *and* their corresponding accesslog as well). If your servers (through misconfiguration or otherwise) missed a contextCSN update to accesslog and that CSN never got updated afterwards, deltasync will have to keep falling back to plain syncrepl (client has a cookie indicating "future" data). > When I try to compare the contextCSN of both servers they differ a > little but only less then 5 seconds at max. The however change > constantly because these servers are used for login and store the last > login and Intruder detection information which must be replicated. > Most of the other data is static but there are some changes (changed > passwords, Name changes) every few minutes or so. For Example a few > minutes ago we had the following values: First I would examine I/O status of the server: is there congestion? Is that on the read/write side? In general, read side congestion suggests low RAM, write side congestion suggests configuration issues or platform I/O limitations. Also if you're limited on write I/O, your cluster will keep struggling however many servers you spin up. If it's write I/O, have a go at reading your accesslog and looking whether there is a major source of writes. Things like pwdLastSuccess can have their granularity limited which often greatly reduces write traffic, etc. > When I last had a situation like this the servers where not in > production and I shut both down, copied the database over and started > them up again. This is not an option now as they are in production and > needed for login. Also preventing updates for longer than a few > minutes is not an option an even this has to be announced ahead of > time. When I last tried adding a brand new Server cluster configured > in a similar way in testing it took multiple hours to get the new > server up to speed. I fear that removing one of the two servers for > multiple hours would overwhelm the remaining server with requests. In > the future the plan is to have at least 3 Servers in this replication > but currently we only have two. It is however an option to prepare new > server(s) and add them to the replication if that might help somehow. > > One other information is that currently the accesslog database is > around 2 GB of size. Good, just like Ulrich mentioned, set up monitoring to make sure it *never* becomes full. Missed accesslog writes cause hard to recover synchronisation errors. Regards, -- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP