On Tue, Sep 01, 2015 at 05:03:10PM +0200, Ludwig Krispenz wrote: > > On 09/01/2015 04:39 PM, Andrew E. Bruno wrote: > >A few months ago we had a replica failure where the system ran out of file > >descriptors and the slapd database was corrupted: > > > >https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html > > > >We now monitor file descriptor counts on our replicas and last night we > >had 2 of our 3 replicas fail and become completely unresponsive. Trying > >to kinit on the replica resulted in: > > > >[user@ipa-master]$ kinit > >kinit: Generic error (see e-text) while getting initial credentials > > > > > >Snippet from the /var/log/dirsrv/slapd-[domain]/errors: > > > >[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: > >Attempting to release replica, but unable to receive endReplication extended > >operation response from the replica. Error -5 (Timed out) > >[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental > >protocol: event backoff_timer_expired should not occur in state start_backoff > >[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > > > >The access logs were filling up with: > > > >[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from > >10.106.14.29 to 10.113.14.30 > >[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from > >10.104.9.137 to 10.113.14.30 > >[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from > >10.104.16.19 to 10.113.14.30 > >[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from > >10.111.11.30 to 10.113.14.30 > >[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from > >10.113.27.115 to 10.113.14.30 > >[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from > >10.111.8.116 to 10.113.14.30 > >[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from > >10.113.25.40 to 10.113.14.30 > >[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from > >10.106.14.27 to 10.113.14.30 > >[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from > >10.111.10.141 to 10.113.14.30 > >[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from > >10.104.14.27 to 10.113.14.30 > >[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from > >10.106.13.132 to 10.113.14.30 > >[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from > >10.113.25.11 to 10.113.14.30 > >[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from > >10.104.15.11 to 10.113.14.30 > >[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from > >10.104.7.136 to 10.113.14.30 > >[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from > >10.113.24.23 to 10.113.14.30 > >[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from > >10.106.12.105 to 10.113.14.30 > >[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from > >10.104.16.41 to 10.113.14.30 > >[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from > >10.104.16.4 to 10.113.14.30 > >[31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from > >10.111.8.12 to 10.113.14.30 > >[31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from > >10.111.8.17 to 10.113.14.30 > >.... > > > > > >Seems like clients were connecting to the replicas but file descriptors were > >not getting released. Our monitoring showed increasing file descriptor counts > >on both replicas (the FD counts are normally ~600): > > > >DateTime | Host | File Descriptor Count > >-------------------------------------------------------------------- > >Mon, 31 Aug 2015 17:28:28 | srv-m14-32 | 1394 > >Mon, 31 Aug 2015 17:28:40 | srv-m14-30 | 1192 > >Mon, 31 Aug 2015 18:28:28 | srv-m14-32 | 2478 > >Mon, 31 Aug 2015 18:28:40 | srv-m14-30 | 2212 > >Mon, 31 Aug 2015 19:28:28 | srv-m14-32 | 3305 > >Mon, 31 Aug 2015 19:28:40 | srv-m14-30 | 3058 > >.... > > > >We can confirm this via logconv.pl: > > > > > >Start of Logs: 31/Aug/2015:14:55:01 > >End of Logs: 31/Aug/2015:16:42:37 > >... > >FDs Taken: 3140 > >FDs Returned: 3160 > >Highest FD Taken: 603 > > > > > >Start of Logs: 31/Aug/2015:16:42:37 > >End of Logs: 31/Aug/2015:20:18:41 > >.. > >FDs Taken: 4562 > >FDs Returned: 1336 > >Highest FD Taken: 3755 > > > > > >We suspect something happened around 31/Aug/2015:17:13:00 that caused both > >replicas to become unresponsive and leak file descriptors. Luckily we caught > >this before the system ran out of file descriptors. We logged onto each > >replica > >and restarted ipa via: > > > > # systemctl restart ipa > > > >This produced the some errors in the logs: > > > >[31/Aug/2015:20:11:02 -0400] - slapd shutting down - signaling operation > >threads - op stack size 0 max work q size 3623 max work q stack size 496 > >[31/Aug/2015:20:11:02 -0400] - slapd shutting down - waiting for 30 threads > >to terminate > >... > >[31/Aug/2015:20:12:34 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting > >up > >[31/Aug/2015:20:12:34 -0400] - WARNING -- Minimum cache size is 512000 -- > >rounding up > >[31/Aug/2015:20:12:34 -0400] - WARNING: changelog: entry cache size 512000B > >is less than db size 400465920B; We recommend to increase the entry cache > >size nsslapd-cachememsize. > >[31/Aug/2015:20:12:34 -0400] - Detected Disorderly Shutdown last time > >Directory Server was running, recovering database. > >[31/Aug/2015:20:12:35 -0400] schema-compat-plugin - warning: no entries set > >up under cn=computers, cn=compat,dc=cbls,dc=ccr,dc=buffalo,dc=edu > >.... > >[31/Aug/2015:20:12:39 -0400] - Skipping CoS Definition cn=Password > >Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, > >which should be added before the CoS Definition. > >[31/Aug/2015:20:12:39 -0400] NSMMReplicationPlugin - changelog program - > >_cl5NewDBFile: PR_DeleteSemaphore: > >/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema; > > NSPR error - -5943 > >[31/Aug/2015:20:13:08 -0400] NSMMReplicationPlugin - changelog program - > >_cl5NewDBFile: PR_DeleteSemaphore: > >/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/0cccfa05-2cb911e5-ac0b8f7e-e0b1a377.sema; > > NSPR error - -5943 > >[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - > >replica_check_for_data_reload: Warning: disordely shutdown for replica > >dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated > >[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - > >replica_check_for_data_reload: Warning: disordely shutdown for replica > >o=ipaca. Check if DB RUV needs to be updated > >[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial > >credentials for principal > >[ldap/[email protected]] in keytab > >[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for > >requested realm) > >[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial > >credentials for principal > >[ldap/[email protected]] in keytab > >[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for > >requested realm) > >.... > > > >Followed by almost 1M lines of... > > > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511742 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511743 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511744 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511745 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511746 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511747 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511748 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511749 (rc: 32) > >[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could > >not delete change record 511750 (rc: 32) > >... > > > > > >The delete_changerecord id went from 511742 to 1471562 in increasing order. > >Are > >these normal or a sign of something more serious? > > > >After we restarted ipa, both replicas eventually came back up and appear to > >be > >operating as normal. > > > >Any ideas what could have caused this and where to look for more info? We'd > >obviously like to prevent this from happening again as we suspect this is > >what > >caused our replica failure back in June as well. > If the server is not responsive and the incoming connections pile up > consuming all the file descriptors, this could be caused by a deadlock in > the server or by some clients not reading responses and blocking all other > threads from progressing. > If you run into this situation again, could you try to get a pstack from the > process ?
Yes, we'll be sure to grab a pstack next time. Could this possibly be related to replication? These errors happened right around the time the servers crashed: - Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica - Incremental protocol: event backoff_timer_expired should not occur in state start_backoff > > About the changlog messages, it looks like after a crash, the information > where changelog trimming should start is lost and it starts at the wrong > number and can no longer find the records in the retro changelog and logs an > error for each changenumber attempted. > We should investigate this and try to prevent it or at least not pollute the > log > > About the "Check if DB RUV ....", it is misleading, the server is doing it, > the message should better read "Checking if ..." Great, thanks for the info. > > > >Should we be concerned about these errors: > > > > replica_check_for_data_reload: Warning: disordely shutdown for replica > > dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated > > replica_check_for_data_reload: Warning: disordely shutdown for replica > > o=ipaca. Check if DB RUV needs to be updated > > > >How do we check if DB RUV needs to be updated? > > > >running: ipa-server-4.1.0-18, 389-ds-base-1.3.3.1-16, CentOS 7.1.1503 > > > >Thanks in advance for any help! > > > >Best, > > > >--Andrew > > > > -- > Manage your subscription for the Freeipa-users mailing list: > https://www.redhat.com/mailman/listinfo/freeipa-users > Go to http://freeipa.org for more info on the project > > -- Manage your subscription for the Freeipa-users mailing list: https://www.redhat.com/mailman/listinfo/freeipa-users Go to http://freeipa.org for more info on the project
