Hi OpenLdap folks,

I ran into an issue with OpenLdap 2.4.44 that I am having trouble finding the 
root cause of.

I run Openldap in syncrepl mode. I have one machine which serves as a write 
endpoint (let’s call it the master node), and many machines which sync from it, 
and serve as read-replicas.

To ensure that they are in-sync with the Master, each read-replica runs 
ldapsearch against the Master node every minute. It looks at the entryCSN 
values for a bunch of objects on the Master, and compares against its own 
entryCSNs for its copy of these objects. It searches a bunch of different 
objects, and in total takes about 3 seconds for a read replica to do this 
search (I have duration logging on LDAP operations enabled by merging in this 
patch 
(http://www.openldap.org/its/index.cgi/Software%20Enhancements?id=8054;page=9). 
About 20 MB are transferred to each read-replica when they run this script.

NOTE: I prefer not to use the contextCSN for this sync because I only care 
about certain objects of the database being in-sync, and I need to know 
specifically which objects are in-sync vs out-of-sync.

I doubled the amount of times this script runs per read-replica. Therefore 
instead of each read-replica running this script once per minute, it was 
running it twice per minute.

Shortly thereafter, I started getting reports from someone who writes to the 
LDAP Master regularly that they are seeing a high amount of write operations 
failing with timeouts and “Connection Refused” errors. I reduced the frequency 
of the script back to once per minute, and the writer reported that they were 
no longer seeing these errors.

I assumed that this Connection Refused error was due to the fact that Openldap 
2.4 uses a single thread for incoming connections (sources: 
https://lwn.net/Articles/755207/, 
https://www.openldap.org/pub/slim/OpenLDAP_Conn_Mgmt.pdf (section 3)), and the 
pending connection backlog on the socket was too high. Therefore the syscall is 
returning Connection Refused. This may be similar to the frontend contention 
issue described in this post: 
(http://www.openldap.org/lists/openldap-devel/201308/msg00003.html).

I noticed that the values for cn=Backload,cn=Threads,cn=Monitor as well as 
cn=Pending,cn=Threads,cn=Monitor got very high when the read-replicas were 
running the script twice as much. For example, Pending is usually sitting 
around 5-6, but during the time of high read traffic, I saw Pending count 
increase by over 1000 times (my graph looks very spiky, with pending threads 
shooting up to 1000x, then down to 10x or 100x the next minute, then back up, 
etc.). I understand that cn=Backload is simply Active + Pending Threads, and 
interestingly Active threads stayed at normal levels. I am wondering what 
Pending threads means exactly, and how is Pending Threads different from 
Read/Write Waiters? (Interestingly, Read/Write Waiters stayed at normal levels.)

I attempted to reproduce this issue by running the script concurrently from a 
few different clients, hoIver, I was unable to get the Pending/Backload Threads 
up to similar levels (this value hovered around 16, which seems healthy. I did 
not see it spike up to similarly high levels). I observed that the latency of 
the Master from the read-replica’s perspective increased quite a bit during 
this test, but was unable to observe Connection Refused issues.

Is my assumption about the cause of this issue (single thread for incoming 
connections) down the right track? Is this behavior (high Pending/Backload 
Threads, Connection Refused errors) a known occurrence? Are there any other 
metrics that I can observe which would indicate what is the cause of the 
Connection Refused errors? Is there a reliable way to repro this issue (without 
doubling the frequency of the read-replica script)?

NOTE: I have the following settings configured, which I suspect may be relevant:
olcConcurrency: 0
olcConnMaxPending: 100
olcConnMaxPendingAuth: 1000
olcGentleHUP: FALSE
olcIdleTimeout: 60
olcIndexSubstrIfMaxLen: 4
olcIndexSubstrIfMinLen: 2
olcIndexSubstrAnyLen: 4
olcIndexSubstrAnyStep: 2
olcIndexIntLen: 4
olcListenerThreads: 1
olcLocalSSF: 71
olcLogLevel: Stats
olcLogLevel: Sync
olcSizeLimit: unlimited
olcSockbufMaxIncoming: 262143
olcSockbufMaxIncomingAuth: 16777215
olcThreads: 16
olcToolThreads: 1
olcWriteTimeout: 0

Thanks,

Sent with [ProtonMail](https://protonmail.com) Secure Email.

Reply via email to