Re: [Freeipa-users] ipa replica failure

Andrew E. Bruno Fri, 19 Jun 2015 21:58:27 -0700

On Fri, Jun 19, 2015 at 09:08:15PM -0700, Janelle wrote:
> On 6/19/15 11:22 AM, Andrew E. Bruno wrote:
> >Hello,
> >
> >First time trouble shooting an ipa server failure and looking for some
> >guidance on how best to proceed.
> >
> >First some background on our setup:
> >
> >Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
> >
> >- ipa-server-4.1.0-18.el7.centos.3.x86_64
> >- 389-ds-base-1.3.3.1-16.el7_1.x86_64
> >
> >3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
> >replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
> >
> >We have ~3000 user accounts (~1000 active the rest disabled). We have
> >~700 hosts enrolled (all installed using ipa-client-install and running
> >sssd). Hosts clients are a mix of centos 7 and centos 6.5.
> >
> >
> >We recently discovered one of our replica servers (rep2) was not
> >responding. A quick check of the dirsrv logs
> >/var/log/dirsrv/slapd-XXXX/errors (sanitized):
> >
> >     PR_Accept() failed, Netscape Portable Runtime error (Process open
> >     FD table is full.)
> >     ...
> >
> >The server was rebooted and after coming back up had these errors in the 
> >logs:
> >
> >         389-Directory/1.3.3.1 B2015.118.1941
> >         replica2:636 (/etc/dirsrv/slapd-XXXX)
> >
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
> >detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 
> >(BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
> >detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
> >(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run 
> >database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
> >detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
> >(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run 
> >database recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
> >detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint 
> >database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
> >recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
> >detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint 
> >database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
> >recovery)
> >[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
> >detected; run recovery
> >[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - 
> >BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
> >....
> >[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting 
> >up
> >[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time 
> >Directory Server was running, recovering database.
> >...
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
> >replica_check_for_data_reload: Warning: disordely shutdown for replica 
> >dc=XXX. Check if DB RUV needs to be updated
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
> >database RUV (from CL RUV) ->  55770068000300030000
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
> >database RUV (from CL RUV) ->  556f4632001400040000
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
> >database RUV (from CL RUV) ->  556f4631004d00050000
> >[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send 
> >startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection 
> >refused)
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
> >agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth 
> >failed: LDAP error -1 (Can't contact LDAP server) ()
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
> >replica_check_for_data_reload: Warning: disordely shutdown for replica 
> >o=ipaca. Check if DB RUV needs to be updated
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
> >database RUV (from CL RUV) ->  556f46290005005b0000
> >[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial 
> >credentials for principal [ldap/rep2] in keytab 
> >[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
> >requested realm)
> >[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
> >not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
> >contact LDAP server) ((null)) errno 111 (Connection refused)
> >[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform 
> >interactive bind for id [] authentication mechanism [GSSAPI]: error -1 
> >(Can't contact LDAP server)
> >[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" 
> >(rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't 
> >contact LDAP server) ()
> >[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password 
> >Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added 
> >before the CoS Definition.
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301438 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301439 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301440 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301441 (rc: 32)
> >....
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301443 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301444 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All Interfaces 
> >port 389 for LDAP requests
> >[16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636 for 
> >LDAPS requests
> >[16/Jun/2015:16:24:15 -0400] - Listening on 
> >/var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301445 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301446 (rc: 32)
> >[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 301447 (rc: 32)
> >...
> >[16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 336362 (rc: 32)
> >[16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
> >not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
> >contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected)
> >[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not perform 
> >interactive bind for id [] authentication mechanism [GSSAPI]: error -1 
> >(Can't contact LDAP server)
> >[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send 
> >startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport 
> >endpoint is not connected)
> >[16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin - 
> >agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind with 
> >SIMPLE auth resumed
> >[16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" 
> >(rep1:389): Replication bind with GSSAPI auth resumed
> >[16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN 
> >(0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
> >[16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for 
> >changenumber=336746,cn=changelog from entryrdn index (-30993)
> >[16/Jun/2015:16:27:28 -0400] - Operation error fetching 
> >changenumber=336746,cn=changelog (null), error -30993.
> >[16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error occured 
> >while adding change number 336746, dn = changenumber=336746,cn=changelog: 
> >Operations error.
> >[16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob: operation 
> >failure [1]
> >..
> >[16/Jun/2015:16:27:36 -0400] - Operation error fetching 
> >changenumber=336762,cn=changelog (null), error -30993.
> >[16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error occured 
> >while adding change number 336762, dn = changenumber=336762,cn=changelog: 
> >Operations error.
> >[16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob: operation 
> >failure [1]
> >[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 
> >BDB0062 Successful return: 0
> >[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 
> >BDB0062 Successful return: 0
> >[17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling operation 
> >threads - op stack size 0 max work q size 3025 max work q stack size 5
> >[17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30 threads 
> >to terminate
> >
> >
> >Assuming we had a corrupted database we first attempted to remove the 
> >replicate
> >by logging into the first master and running:
> >
> >     # ipa-replica-manage del rep2
> >
> >This process hung indefinitely. So we proceeded to shutdown all ipa services 
> >on
> >rep2 (systemctl stop ipa).. then re-ran the command on the first master:
> >
> >     # ipa-replica-manage -v --force del rep2
> >
> >This appeared to work ok and rep2 appears to have been deleted:
> >
> >     # ipa-replica-manage list
> >     rep3: master
> >     rep1: master
> >
> >However, when querying ldap nsDS5ReplicationAgreement we still see rep2 with 
> >a
> >replica 97 id for the ipca:
> >
> ># ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
> >objectClass=nsDS5ReplicationAgreement -LL
> >
> >dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
> >tree,cn=config
> >objectClass: top
> >objectClass: nsds5replicationagreement
> >cn: masterAgreement1-rep3-pki-tomcat
> >nsDS5ReplicaRoot: o=ipaca
> >nsDS5ReplicaHost: rep3
> >nsDS5ReplicaPort: 389
> >nsDS5ReplicaBindDN: cn=Replication Manager 
> >cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
> >nsDS5ReplicaBindMethod: Simple
> >nsDS5ReplicaTransportInfo: TLS
> >description: masterAgreement1-rep3-pki-tomcat
> >..
> >nsds50ruv: {replicageneration} 5527f74b000000600000
> >nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
> >  0000 5582c7e40004005b0000
> >nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
> >  0000 5582cd19000000600000
> >nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
> >  0000 556f462b000400610000
> >nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
> >  0000000
> >nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
> >  0000000
> >nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
> >  0000000
> >nsds5replicaLastUpdateStart: 20150619173149Z
> >nsds5replicaLastUpdateEnd: 20150619173149Z
> >nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
> >nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental 
> >upd
> >  ate succeeded
> >nsds5replicaUpdateInProgress: FALSE
> >nsds5replicaLastInitStart: 0
> >nsds5replicaLastInitEnd: 0
> >
> >
> >Questions:
> >
> >0. Is it likely that after running out of file descriptors the dirsrv
> >slapd database on rep2 was corrupted?
> I have increased FDs to 32768, although have a somewhat larger environment
> and the bulk of logins occur via LDAP, so that uses up more FDs.  From
> looking at your numbers, I don't think 16K would be unreasonable. Remember
> to set /etc/sysconfig/dirsrv.systemd as well as the ldapmodify settings and
> finally /etc/security/limits.conf


Excellent, thanks for the tip. 

In regards to the ldapmodify settings, is safe to run "systemctl restart
dirsrv@DOMAIN" or should "systemctl restart ipa" be run, i.e. can you
restart dirsrv without restarting other ipa services?

Also, curious if you updated nsslapd-reservedescriptors as well?

> >
> >1. Do we have to run ipa-replica-manage del rep2 on *each* of the
> >remaining replica servers (rep1 and rep3)? Or should it just be run on
> >the first master? Do we need to run ipa-csreplicate-manage del as well?
> You mentioned it was a CA --
> ipa-csreplica-manage and run the same "delete" options to remove it.

I ran ipa-csreplica-manage del but got an error:

'rep1' has no replication agreement for 'rep2'

It almost seems like the ipaca replica agreement between rep3 and rep2
is orphaned. It still exists in ldap but  ipa-csreplica-manage list
commands don't show it? And I can't seem to delete it using the
ipa-replica commands. 

I'm anticipating issues when trying to add my failed replicate back in
without resolving this. 


> >
> >2. Why does the rep2 server still appear when querying the
> >nsDS5ReplicationAgreement in ldap? Is this benign or will this pose problems
> >when we go to add rep2 back in?
> See above
> >3. What steps/commands can we take to verify rep2 was successfully removed 
> >and
> >replication is behaving normally?
> After above - it should be gone

No luck..  ipa-csreplica-manage list shows only 2 masters (rep3 and
rep1). But ldapsearch still shows references to the deleted replicate
rep2.


> >
> >We had tuned our servers according to the rhel  Performance Tuning
> >Guide:
> >
> >    # cat /proc/sys/fs/file-max
> >    6534908
> >
> >    # cat /etc/security/limits.d/40-freeipa.conf
> >    *         soft      nofile      8192
> >    *         hard      nofile      8192
> >
> >    # cat /etc/sysctl.d/40-freeipa.conf
> >    net.ipv4.ip_local_port_range = 1024 65000
> >    net.ipv4.tcp_keepalive_time = 300
> >
> >    # cat /etc/sysconfig/dirsrv.systemd
> >    [Service]
> >    LimitNOFILE=8192
> 
> Do it.  Also, have you run an lsof of the PID of ns-slapd to see what the
> connections are?

$ lsof -a -p 2868 | wc -l
586

$ cat /proc/sys/fs/file-nr
3232    0   6534908

$ ls -alh /proc/2868/fd | wc -l
466

$  ldapsearch "cn=monitor" "(objectclass=*)"
currentconnections: 339
totalconnections: 51585
currentconnectionsatmaxthreads: 0
maxthreadsperconnhits: 485
dtablesize: 8192
readwaiters: 0
opsinitiated: 2309677
opscompleted: 2309676
entriessent: 1526462
bytessent: 24606941647
currenttime: 20150620044615Z
starttime: 20150618141112Z
nbackends: 3


> This is a very simply thing to increase.
> 
> You mention all your logins are via ipa-client-install, which implies
> Kerberos. However, I wonder if this is the case. Check the lsof output, but
> I would also consider running logconv.pl against your access logs in
> /var/log/dirsrv/slapd-INSTANCE to really see what is going on behind the
> scenes.

Thanks again for the tip.. I'll checkout logconv.pl

> 
> ~J
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project
> 
> 

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] ipa replica failure

Reply via email to