Re: [Freeipa-users] replicas unresponsive with increasing file descriptors

Andrew E. Bruno Tue, 01 Sep 2015 08:27:54 -0700

On Tue, Sep 01, 2015 at 05:03:10PM +0200, Ludwig Krispenz wrote:
> 
> On 09/01/2015 04:39 PM, Andrew E. Bruno wrote:
> >A few months ago we had a replica failure where the system ran out of file
> >descriptors and the slapd database was corrupted:
> >
> >https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html
> >
> >We now monitor file descriptor counts on our replicas and last night we
> >had 2 of our 3 replicas fail and become completely unresponsive. Trying
> >to kinit on the replica resulted in:
> >
> >[user@ipa-master]$ kinit
> >kinit: Generic error (see e-text) while getting initial credentials
> >
> >
> >Snippet from the /var/log/dirsrv/slapd-[domain]/errors:
> >
> >[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: 
> >Attempting to release replica, but unable to receive endReplication extended 
> >operation response from the replica. Error -5 (Timed out)
> >[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental 
> >protocol: event backoff_timer_expired should not occur in state start_backoff
> >[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >
> >The access logs were filling up with:
> >
> >[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 
> >10.106.14.29 to 10.113.14.30
> >[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 
> >10.104.9.137 to 10.113.14.30
> >[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 
> >10.104.16.19 to 10.113.14.30
> >[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 
> >10.111.11.30 to 10.113.14.30
> >[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 
> >10.113.27.115 to 10.113.14.30
> >[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 
> >10.111.8.116 to 10.113.14.30
> >[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 
> >10.113.25.40 to 10.113.14.30
> >[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 
> >10.106.14.27 to 10.113.14.30
> >[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 
> >10.111.10.141 to 10.113.14.30
> >[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 
> >10.104.14.27 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 
> >10.106.13.132 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 
> >10.113.25.11 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 
> >10.104.15.11 to 10.113.14.30
> >[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 
> >10.104.7.136 to 10.113.14.30
> >[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 
> >10.113.24.23 to 10.113.14.30
> >[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 
> >10.106.12.105 to 10.113.14.30
> >[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 
> >10.104.16.41 to 10.113.14.30
> >[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 
> >10.104.16.4 to 10.113.14.30
> >[31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 
> >10.111.8.12 to 10.113.14.30
> >[31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 
> >10.111.8.17 to 10.113.14.30
> >....
> >
> >
> >Seems like clients were connecting to the replicas but file descriptors were
> >not getting released. Our monitoring showed increasing file descriptor counts
> >on both replicas (the FD counts are normally ~600):
> >
> >DateTime                  |   Host        |   File Descriptor Count
> >--------------------------------------------------------------------
> >Mon, 31 Aug 2015 17:28:28 |   srv-m14-32  |   1394
> >Mon, 31 Aug 2015 17:28:40 |   srv-m14-30  |   1192
> >Mon, 31 Aug 2015 18:28:28 |   srv-m14-32  |   2478
> >Mon, 31 Aug 2015 18:28:40 |   srv-m14-30  |   2212
> >Mon, 31 Aug 2015 19:28:28 |   srv-m14-32  |   3305
> >Mon, 31 Aug 2015 19:28:40 |   srv-m14-30  |   3058
> >....
> >
> >We can confirm this via logconv.pl:
> >
> >
> >Start of Logs:    31/Aug/2015:14:55:01
> >End of Logs:      31/Aug/2015:16:42:37
> >...
> >FDs Taken:                    3140
> >FDs Returned:                 3160
> >Highest FD Taken:             603
> >
> >
> >Start of Logs:    31/Aug/2015:16:42:37
> >End of Logs:      31/Aug/2015:20:18:41
> >..
> >FDs Taken:                    4562
> >FDs Returned:                 1336
> >Highest FD Taken:             3755
> >
> >
> >We suspect something happened around 31/Aug/2015:17:13:00 that caused both
> >replicas to become unresponsive and leak file descriptors. Luckily we caught
> >this before the system ran out of file descriptors. We logged onto each 
> >replica
> >and restarted ipa via:
> >
> >     # systemctl restart ipa
> >
> >This produced the some errors in the logs:
> >
> >[31/Aug/2015:20:11:02 -0400] - slapd shutting down - signaling operation 
> >threads - op stack size 0 max work q size 3623 max work q stack size 496
> >[31/Aug/2015:20:11:02 -0400] - slapd shutting down - waiting for 30 threads 
> >to terminate
> >...
> >[31/Aug/2015:20:12:34 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting 
> >up
> >[31/Aug/2015:20:12:34 -0400] - WARNING -- Minimum cache size is 512000 -- 
> >rounding up
> >[31/Aug/2015:20:12:34 -0400] - WARNING: changelog: entry cache size 512000B 
> >is less than db size 400465920B; We recommend to increase the entry cache 
> >size nsslapd-cachememsize.
> >[31/Aug/2015:20:12:34 -0400] - Detected Disorderly Shutdown last time 
> >Directory Server was running, recovering database.
> >[31/Aug/2015:20:12:35 -0400] schema-compat-plugin - warning: no entries set 
> >up under cn=computers, cn=compat,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> >....
> >[31/Aug/2015:20:12:39 -0400] - Skipping CoS Definition cn=Password 
> >Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, 
> >which should be added before the CoS Definition.
> >[31/Aug/2015:20:12:39 -0400] NSMMReplicationPlugin - changelog program - 
> >_cl5NewDBFile: PR_DeleteSemaphore: 
> >/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema;
> > NSPR error - -5943
> >[31/Aug/2015:20:13:08 -0400] NSMMReplicationPlugin - changelog program - 
> >_cl5NewDBFile: PR_DeleteSemaphore: 
> >/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/0cccfa05-2cb911e5-ac0b8f7e-e0b1a377.sema;
> > NSPR error - -5943
> >[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - 
> >replica_check_for_data_reload: Warning: disordely shutdown for replica 
> >dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
> >[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - 
> >replica_check_for_data_reload: Warning: disordely shutdown for replica 
> >o=ipaca. Check if DB RUV needs to be updated
> >[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial 
> >credentials for principal 
> >[ldap/[email protected]] in keytab 
> >[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
> >requested realm)
> >[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial 
> >credentials for principal 
> >[ldap/[email protected]] in keytab 
> >[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
> >requested realm)
> >....
> >
> >Followed by almost 1M lines of...
> >
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511742 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511743 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511744 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511745 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511746 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511747 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511748 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511749 (rc: 32)
> >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 511750 (rc: 32)
> >...
> >
> >
> >The delete_changerecord id went from 511742 to 1471562 in increasing order. 
> >Are
> >these normal or a sign of something more serious?
> >
> >After we restarted ipa, both replicas eventually came back up and appear to 
> >be
> >operating as normal.
> >
> >Any ideas what could have caused this and where to look for more info? We'd
> >obviously like to prevent this from happening again as we suspect this is 
> >what
> >caused our replica failure back in June as well.
> If the server is not responsive and the incoming connections pile up
> consuming all the file descriptors, this could be caused by a deadlock in
> the server or by some clients not reading responses and blocking all other
> threads from progressing.
> If you run into this situation again, could you try to get a pstack from the
> process ?


Yes, we'll be sure to grab a pstack next time. Could this possibly be
related to replication? These errors happened right around the time the
servers crashed:

 - Warning: Attempting to release replica, but unable to receive endReplication 
extended operation response from the replica
 - Incremental protocol: event backoff_timer_expired should not occur in state 
start_backoff

> 
> About the changlog messages, it looks like after a crash, the information
> where changelog trimming should start is lost and it starts at the wrong
> number and can no longer find the records in the retro changelog and logs an
> error for each changenumber attempted.
> We should investigate this and try to prevent it or at least not pollute the
> log
> 
> About the "Check if DB RUV ....", it is misleading, the server is doing it,
> the message should better read "Checking if ..."

Great, thanks for the info.


> >
> >Should we be concerned about these errors:
> >
> >  replica_check_for_data_reload: Warning: disordely shutdown for replica 
> > dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
> >  replica_check_for_data_reload: Warning: disordely shutdown for replica 
> > o=ipaca. Check if DB RUV needs to be updated
> >
> >How do we check if DB RUV needs to be updated?
> >
> >running: ipa-server-4.1.0-18, 389-ds-base-1.3.3.1-16, CentOS 7.1.1503
> >
> >Thanks in advance for any help!
> >
> >Best,
> >
> >--Andrew
> >
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project
> 
> 

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] replicas unresponsive with increasing file descriptors

Reply via email to