Hi,

I'm experiencing an issue between my 3 providers and multiple consumer setup 
and delta sync repl. We manage a primary, or active, provider and send all 
writes to the primary as long as it's healthy letting the two others replicate 
and be standby providers ready to take over in the event of a failure. All 
consumers replicate from all providers and all providers replicate from all 
providers. After the system was running healthily for over a week a standby 
provider was restarted. This caused my consumers to re-establish the persistent 
sync connection. Upon re-establishing the connection, some consumers began a 
sync refresh with the following message.

Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003 
starting refresh (sending 
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000)
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 
LDAP_RES_SEARCH_RESULT
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 
delta-sync lost sync, switching to REFRESH
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 (4096) 
Content Sync Refresh Required

This was re-establishing a connection with rid=003 which is 
"20210412175600.595586Z#000000#835#000000" (a standby system)
however we have only been sending writes to server #44d# (the primary 
provider). We see 44d CSN is over 7 days old, beyond our providers access log 
period. 
On  the consumer that did not trigger sync refresh we see 

Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003 
starting refresh (sending 
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000)
Here we see 20210621214400.407867Z#000000#44d#000000 is much more recent and 
did not trigger a full resync, although it is close to the 7 day threshold at 
this point. We notice the rid=003 835 csn is the same as the consumer 
experiencing the problem which makes me believe the #44d# csn being old is what 
causes this sync refresh.

I am concerned why when the standby provider is restarted the connection is 
getting re-established with old provider CSNs, when I search the CSNs on the 
consumers they look newer than the ones used to reestablish the connection. If 
we restart slapd on the providers after running consumers for 7 days it seems 
like it will trigger a sync refresh. How can we make the consumers re-establish 
the connection with the most recent CSN? Replication is working as expected, 
just the CSNs seem to remain old in this connection message. The sync refresh 
behavior causes a large load on the consumers and providers spiking bind times 
and degrading service making this concerning for our production environment.

Configuration details are below.
Your insights are appreciated.

@(#) $OpenLDAP: slapd 2.4.56 (Dec 14 2020 17:31:23) $
        
@57af0b7ce0ba:/root/rpmbuild/BUILD/openldap-2.4.56/openldap-2.4.56/servers/slapd
Provider config
Primary DB

# {2}mdb, config
dn: olcDatabase={2}mdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcMdbConfig
olcDatabase: {2}mdb
olcDbDirectory: /var/lib/ldap
...
olcSizeLimit: unlimited
olcSyncrepl: {0}rid=1   provider=ldap://10.20.0.4   bindmethod=simple   binddn
 ="cn=replicant,dc=example,dc=com"   credentials=""   
searchbase="dc=example,dc=com"   schemachecking=on   type=refresh
 AndPersist   retry="5 10 60 +"   sizelimit=unlimited   timelimit=unlimited   
 starttls=critical   tls_reqcert=demand   logbase="cn=accesslog"   logfilter="
 (&(objectClass=auditWriteObject)(reqResult=0))"   syncdata=accesslog
olcSyncrepl: {1}rid=2   provider=ldap://10.20.1.151   bindmethod=simple   bind
 dn="cn=replicant,dc=example,dc=com"   credentials=""   
searchbase="dc=example,dc=com"   schemachecking=on   type=refre
 shAndPersist   retry="5 10 60 +"   sizelimit=unlimited   timelimit=unlimited 
   starttls=critical   tls_reqcert=demand   logbase="cn=accesslog"   logfilter
 ="(&(objectClass=auditWriteObject)(reqResult=0))"   syncdata=accesslog
olcSyncrepl: {2}rid=3   provider=ldap://10.20.2.240   bindmethod=simple   bind
 dn="cn=replicant,dc=example,dc=com"   credentials=""   
searchbase="dc=example,dc=com"   schemachecking=on   type=refre
 shAndPersist   retry="5 10 60 +"   sizelimit=unlimited   timelimit=unlimited 
   starttls=critical   tls_reqcert=demand   logbase="cn=accesslog"   logfilter
 ="(&(objectClass=auditWriteObject)(reqResult=0))"   syncdata=accesslog
olcMirrorMode: TRUE
...
olcDbMaxSize: 15000000000
Sync Prov
# {1}syncprov, {2}mdb, config
dn: olcOverlay={1}syncprov,olcDatabase={2}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {1}syncprov
olcSpCheckpoint: 1 1
Access Log
# {3}accesslog, {2}mdb, config
dn: olcOverlay={3}accesslog,olcDatabase={2}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcAccessLogConfig
olcOverlay: {3}accesslog
olcAccessLogDB: cn=accesslog
olcAccessLogOps: writes
olcAccessLogPurge: 07+00:00 01+00:00
olcAccessLogSuccess: TRUE
# {3}mdb, config
dn: olcDatabase={3}mdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcMdbConfig
olcDatabase: {3}mdb
olcDbDirectory: /var/lib/ldap/accesslog
olcSuffix: cn=accesslog
olcRootDN: cn=admin,dc=example,dc=com
olcDbIndex: default eq
olcDbIndex: entryCSN,objectClass,reqEnd,reqResult,reqStart
olcDbMaxSize: 15000000000
# {0}syncprov, {3}mdb, config
dn: olcOverlay={0}syncprov,olcDatabase={3}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {0}syncprov
olcSpNoPresent: TRUE
olcSpReloadHint: TRUE

Consumer Config:
# {2}hdb, config
dn: olcDatabase={2}hdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcHdbConfig
olcDatabase: {2}hdb
olcDbDirectory: /var/lib/ldap
olcSuffix: dc=example,dc=com
olcRootDN: cn=admin,dc=example,dc=com
olcRootPW: {SSHA}
olcSyncrepl: {0}rid=1   provider=ldap://10.20.0.4   bindmethod=simple   binddn
 ="cn=replicant,dc=example,dc=com"   credentials=""   
searchbase="dc=example,dc=com"   schemachecking=on   type=refresh
 AndPersist   retry="60 +"   sizelimit=unlimited   timelimit=unlimited   start
 tls=critical   tls_reqcert=demand   logbase="cn=accesslog"   logfilter="(&(ob
 jectClass=auditWriteObject)(reqResult=0))"   syncdata=accesslog
olcSyncrepl: {1}rid=2   provider=ldap://10.20.1.151   bindmethod=simple   bind
 dn="cn=replicant,dc=example,dc=com"   credentials=""   
searchbase="dc=example,dc=com"   schemachecking=on   type=refre
 shAndPersist   retry="60 +"   sizelimit=unlimited   timelimit=unlimited   sta
 rttls=critical   tls_reqcert=demand   logbase="cn=accesslog"   logfilter="(&(
 objectClass=auditWriteObject)(reqResult=0))"   syncdata=accesslog
olcSyncrepl: {2}rid=3   provider=ldap://10.20.2.240   bindmethod=simple   bind
 dn="cn=replicant,dc=example,dc=com"   credentials=""   
searchbase="dc=example,dc=com"   schemachecking=on   type=refre
 shAndPersist   retry="60 +"   sizelimit=unlimited   timelimit=unlimited   sta
 rttls=critical   tls_reqcert=demand   logbase="cn=accesslog"   logfilter="(&(
 objectClass=auditWriteObject)(reqResult=0))"   syncdata=accesslog
olcDbCheckpoint: 102400 10

Reply via email to