Re: [Freeipa-users] Replication has stopped and server errors

sipazzo Fri, 13 Jan 2017 05:08:11 -0800

I am happy to report this appears to be resolved. I found this post: 
https://www.redhat.com/archives/freeipa-users/2014-February/msg00007.html which 
pointed me to the csn skew issue which was causing all my replication failures. 
I performed the steps in the post and things look much better so far.
Thank you.

      From: sipazzo <[email protected]>
 To: Martin Basti <[email protected]>; Freeipa-users <[email protected]> 
 Sent: Friday, January 6, 2017 1:03 PM
 Subject: Re: [Freeipa-users] Replication has stopped and server errors

I have changed the number of db locks to 40000. After restart, each server 
reports a lot of these type errors:
DSRetroclPlugin - delete_changerecord: could not delete change record 6038434 

As well as immediately coming up with these errors (even after re-initializing)

06/Jan/2017:12:10:12 -0800] NSMMReplicationPlugin - changelog program - 
agmt="cn=meToipa1-dev.example.local" (ipa1-corp:389): CSN 586d8aab000400110000 
not found, we aren't as up to date, or we purged
[06/Jan/2017:12:10:12 -0800] NSMMReplicationPlugin - 
agmt="cn=meToipa1-corp.example.local" (ipa1-dev:389): Data required to update 
replica has been purged. The replica must be reinitialized.
[06/Jan/2017:12:10:12 -0800] NSMMReplicationPlugin - 
agmt="cn=meToipa1-prod.example.local" (ipa1-xo:389): Incremental update failed 
and requires administrator action
[06/Jan/2017:12:10:12 -0800] NSMMReplicationPlugin - 
agmt="cn=meToipa1-dev.example.local" (ipa1-corp:389): Incremental update failed 
and requires administrator action
06/Jan/2017:12:15:47 -0800] agmt="cn=meToipa1-dr.example.local" (ipa1-io:389) - 
Can't locate CSN 586ffaf5000300100000 in the changelog (DB rc=-30988). If 
replication stops, the consumer may need to be reinitialized.
[06/Jan/2017:12:15:49 -0800] agmt="cn=meToipa1-dr.example.local" (ipa1-io:389) 
- Can't locate CSN 586ffaf7000000100000 in the changelog (DB rc=-30988). If 
replication stops, the consumer may need to be reinitialized.

Replication topology is:3 geographic locations each with 2 ipa servers (dr, 
prod, dev)
ipa1-dev replicates with all servers (ipa2-dev,ipa1-dr, ipa2-dr, ipa1-prod, 
ipa2-prod)ipa1-dr also replicates with ipa2-dripa1-prod also replicates with 
ipa2-prod

As a test I deleted one host on each of the servers. I have waited 30 minutes 
and the results are:ipa1-dev - deletion replicated to all serversipa2-dr - 
deletion replicated to all servers
ipa1-dr, ipa1-prod, ipa2-dev, ipa2-prod - deletions not replicated

      From: Martin Basti <[email protected]>
 To: sipazzo <[email protected]>; Freeipa-users <[email protected]> 
 Sent: Friday, January 6, 2017 8:58 AM
 Subject: Re: [Freeipa-users] Replication has stopped and server errors

 On 06.01.2017 00:29, sipazzo wrote:

  I have 6 ipa servers in 3 locations running 4.2.0-15.0.1on RHEL 7. Ipa1-dev 
is the CA Renewal and CRL Master server and where most of our updates  (host 
enrollment, password changes) end up taking place.   Servers had been running 
fine. Over the holidays we started having some replication issues and looking 
at /var/log/dirsrv/slapd-REALM-COM/errors showed the following: 
  All servers currently have these errors for each replica the respective IPA 
servers are connected to: NSMMReplicationPlugin - 
agmt="cn=meToipa2-dr.example.local" (ipa2-dr:389): Incremental update failed 
and requires administrator action [04/Jan/2017:15:39:48 -0800] 
agmt="cn=meToipa1-dr.example.local" (ipa1-dr:389) - Can't locate CSN 
583c8e74000600110000 in the changelog (DB  rc=-30988). If replication stops, 
the consumer may need to be reinitialized NSMMReplicationPlugin - 
agmt="cn=meToipa1-prod.example.local" (ipa1-prod:389): Data required to update 
replica has been purged. The replica must be reinitialized. 
[04/Jan/2017:13:33:26 -0800] NSMMReplicationPlugin - 
agmt="cn=meToipa2-dev.example.local" (ipa2-dev:389): Incremental update failed 
and requires administrator action  [04/Jan/2017:13:33:26 -0800] 
NSMMReplicationPlugin - agmt="cn=meToipa1-prod.example.local" (ipa1-prod:389): 
Incremental update failed and requires administrator action 
[04/Jan/2017:13:33:27 -0800] agmt="cn=meToipa2-prod.example.local" 
(ipa2-prod:389) - Can't locate CSN 586d69f0000400120000 in the changelog (DB 
rc=-30988). If replication stops, the consumer may need to be reinitialized.   
And all servers have these types of errors which are worrisome but they go back 
quite a way
  NSACLPlugin - The ACL target cn=dns,dc=example,dc=local does not exist 
NSACLPlugin - The ACL target cn=dns,dc=example,dc=local does not exist 
NSACLPlugin - The ACL target cn=groups,cn=compat,dc=example,dc=local does not 
exist NSACLPlugin - The ACL target cn=computers,cn=compat,dc=example,dc=local 
does not exist NSACLPlugin - The ACL target cn=casigningcert 
cert-pki-ca,cn=ca_renewal,cn=ipa,cn=etc,dc=example,dc=local does not exist 
NSACLPlugin - The ACL target cn=casigningcert 
cert-pki-ca,cn=ca_renewal,cn=ipa,cn=etc,dc=example,dc=local does not exist 
NSACLPlugin - The ACL target ou=sudoers,dc=networkfleet,dc=local does not exist 

 ^^^ just INFO messages, you can ignore them

    All servers except one have a lot of these DSRetroclPlugin - 
delete_changerecord: could not delete change record   Ipa1-dev only has this
  04/Jan/2017:18:36:52 -0800] NSMMReplicationPlugin 
-agmt="cn=masterAgreement1-ipa1-prod.example.local-pki-tomcat" (ipa1-prod:389): 
Replication bind with SIMPLE auth resumed [04/Jan/2017:18:36:52 -0800] 
NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-ipa2-dr.example.local-pki-tomcat" (ipa2-dr:389): 
Replication bind with SIMPLE auth resumed [04/Jan/2017:18:36:52 -0800] 
NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-ipa1-dr.example.local-pki-tomcat" (ipa1-dr:389): 
Replication bind with SIMPLE auth resumed [04/Jan/2017:18:36:53 -0800] 
NSMMReplicationPlugin 
-agmt="cn=masterAgreement1-ipa2-prod.example.local-pki-tomcat" (ipa2-prod:389): 
Replication bind with SIMPLE auth resumed   3 servers (ipa1-dr ipa2-dr 
ipa2-prod) have these errors:  [01/Jan/2017:14:43:06 -0800] - libdb: BDB2055 
Lock table is out of available lock entries [01/Jan/2017:14:43:06 -0800] - 
compactdb: failed to compact changelog; db error - 12 Cannot allocate memory  

 you probably need https://access.redhat.com/solutions/1241063 to increase 
number of locks (or in this 
threadhttps://lists.fedoraproject.org/pipermail/389-users/2011-June/013299.html)

 I would first increase the number of locks, and then look if something 
improved.
 We also don't know how your topology looks like, which servers are connected 
together.

 Martin

    4 servers (ipa1-dev, ipa2-dev, ipa1-dr and ipa2-dr) have these errors
  [04/Jan/2017:15:37:21 -0800] slapd_ldap_sasl_interactive_bind - Error: could 
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected) 
[04/Jan/2017:15:37:24 -0800] slapd_ldap_sasl_interactive_bind - Error: could 
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected)  
  I have tried various combinations or restarting, re-initializing, 
disconnecting and reconnecting replicas but am down to only two servers 
replicating with each other currently (ipa1-dev and ipa2-dev). We did have a 
power outage at the dev location but it does not seem to correspond to when the 
errors started? Not sure how to recover from this. Any help is appreciated

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] Replication has stopped and server errors

Reply via email to