On 09/01/2015 04:39 PM, Andrew E. Bruno wrote:
A few months ago we had a replica failure where the system ran out of file
descriptors and the slapd database was corrupted:

https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html

We now monitor file descriptor counts on our replicas and last night we
had 2 of our 3 replicas fail and become completely unresponsive. Trying
to kinit on the replica resulted in:

[user@ipa-master]$ kinit
kinit: Generic error (see e-text) while getting initial credentials


Snippet from the /var/log/dirsrv/slapd-[domain]/errors:

[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: 
Attempting to release replica, but unable to receive endReplication extended operation 
response from the replica. Error -5 (Timed out)
[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.
[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.
[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.
[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.
[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.
[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental 
protocol: event backoff_timer_expired should not occur in state start_backoff
[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.
[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer (Timed out). 
Will retry later.

The access logs were filling up with:

[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 
10.106.14.29 to 10.113.14.30
[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 
10.104.9.137 to 10.113.14.30
[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 
10.104.16.19 to 10.113.14.30
[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 
10.111.11.30 to 10.113.14.30
[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 
10.113.27.115 to 10.113.14.30
[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 
10.111.8.116 to 10.113.14.30
[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 
10.113.25.40 to 10.113.14.30
[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 
10.106.14.27 to 10.113.14.30
[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 
10.111.10.141 to 10.113.14.30
[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 
10.104.14.27 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 
10.106.13.132 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 
10.113.25.11 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 
10.104.15.11 to 10.113.14.30
[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 
10.104.7.136 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 
10.113.24.23 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 
10.106.12.105 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 
10.104.16.41 to 10.113.14.30
[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 
10.104.16.4 to 10.113.14.30
[31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 
10.111.8.12 to 10.113.14.30
[31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 
10.111.8.17 to 10.113.14.30
....


Seems like clients were connecting to the replicas but file descriptors were
not getting released. Our monitoring showed increasing file descriptor counts
on both replicas (the FD counts are normally ~600):

DateTime                  |   Host        |   File Descriptor Count
--------------------------------------------------------------------
Mon, 31 Aug 2015 17:28:28 |   srv-m14-32  |   1394
Mon, 31 Aug 2015 17:28:40 |   srv-m14-30  |   1192
Mon, 31 Aug 2015 18:28:28 |   srv-m14-32  |   2478
Mon, 31 Aug 2015 18:28:40 |   srv-m14-30  |   2212
Mon, 31 Aug 2015 19:28:28 |   srv-m14-32  |   3305
Mon, 31 Aug 2015 19:28:40 |   srv-m14-30  |   3058
....

We can confirm this via logconv.pl:


Start of Logs:    31/Aug/2015:14:55:01
End of Logs:      31/Aug/2015:16:42:37
...
FDs Taken:                    3140
FDs Returned:                 3160
Highest FD Taken:             603


Start of Logs:    31/Aug/2015:16:42:37
End of Logs:      31/Aug/2015:20:18:41
..
FDs Taken:                    4562
FDs Returned:                 1336
Highest FD Taken:             3755


We suspect something happened around 31/Aug/2015:17:13:00 that caused both
replicas to become unresponsive and leak file descriptors. Luckily we caught
this before the system ran out of file descriptors. We logged onto each replica
and restarted ipa via:

     # systemctl restart ipa

This produced the some errors in the logs:

[31/Aug/2015:20:11:02 -0400] - slapd shutting down - signaling operation 
threads - op stack size 0 max work q size 3623 max work q stack size 496
[31/Aug/2015:20:11:02 -0400] - slapd shutting down - waiting for 30 threads to 
terminate
...
[31/Aug/2015:20:12:34 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
[31/Aug/2015:20:12:34 -0400] - WARNING -- Minimum cache size is 512000 -- 
rounding up
[31/Aug/2015:20:12:34 -0400] - WARNING: changelog: entry cache size 512000B is 
less than db size 400465920B; We recommend to increase the entry cache size 
nsslapd-cachememsize.
[31/Aug/2015:20:12:34 -0400] - Detected Disorderly Shutdown last time Directory 
Server was running, recovering database.
[31/Aug/2015:20:12:35 -0400] schema-compat-plugin - warning: no entries set up 
under cn=computers, cn=compat,dc=cbls,dc=ccr,dc=buffalo,dc=edu
....
[31/Aug/2015:20:12:39 -0400] - Skipping CoS Definition cn=Password 
Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, 
which should be added before the CoS Definition.
[31/Aug/2015:20:12:39 -0400] NSMMReplicationPlugin - changelog program - 
_cl5NewDBFile: PR_DeleteSemaphore: 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema;
 NSPR error - -5943
[31/Aug/2015:20:13:08 -0400] NSMMReplicationPlugin - changelog program - 
_cl5NewDBFile: PR_DeleteSemaphore: 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/0cccfa05-2cb911e5-ac0b8f7e-e0b1a377.sema;
 NSPR error - -5943
[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica 
dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
[31/Aug/2015:20:13:11 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. 
Check if DB RUV needs to be updated
[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial credentials 
for principal [ldap/srv-m14-32.cbls.ccr.buffalo....@cbls.ccr.buffalo.edu] in 
keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
requested realm)
[31/Aug/2015:20:13:11 -0400] set_krb5_creds - Could not get initial credentials 
for principal [ldap/srv-m14-32.cbls.ccr.buffalo....@cbls.ccr.buffalo.edu] in 
keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
requested realm)
....

Followed by almost 1M lines of...

[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511742 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511743 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511744 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511745 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511746 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511747 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511748 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511749 (rc: 32)
[31/Aug/2015:20:13:11 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 511750 (rc: 32)
...


The delete_changerecord id went from 511742 to 1471562 in increasing order. Are
these normal or a sign of something more serious?

After we restarted ipa, both replicas eventually came back up and appear to be
operating as normal.

Any ideas what could have caused this and where to look for more info? We'd
obviously like to prevent this from happening again as we suspect this is what
caused our replica failure back in June as well.
If the server is not responsive and the incoming connections pile up consuming all the file descriptors, this could be caused by a deadlock in the server or by some clients not reading responses and blocking all other threads from progressing. If you run into this situation again, could you try to get a pstack from the process ?

About the changlog messages, it looks like after a crash, the information where changelog trimming should start is lost and it starts at the wrong number and can no longer find the records in the retro changelog and logs an error for each changenumber attempted. We should investigate this and try to prevent it or at least not pollute the log

About the "Check if DB RUV ....", it is misleading, the server is doing it, the message should better read "Checking if ..."

Should we be concerned about these errors:

  replica_check_for_data_reload: Warning: disordely shutdown for replica 
dc=cbls,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
  replica_check_for_data_reload: Warning: disordely shutdown for replica 
o=ipaca. Check if DB RUV needs to be updated

How do we check if DB RUV needs to be updated?

running: ipa-server-4.1.0-18, 389-ds-base-1.3.3.1-16, CentOS 7.1.1503

Thanks in advance for any help!

Best,

--Andrew


--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Reply via email to