Re: [Freeipa-users] replicas unresponsive with increasing file descriptors
On 09/01/2015 09:20 AM, Andrew E. Bruno wrote: On Tue, Sep 01, 2015 at 05:03:10PM +0200, Ludwig Krispenz wrote: On 09/01/2015 04:39 PM, Andrew E. Bruno wrote: A few months ago we had a replica failure where the system ran out of file descriptors and the slapd database was corrupted: https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html We now monitor file descriptor counts on our replicas and last night we had 2 of our 3 replicas fail and become completely unresponsive. Trying to kinit on the replica resulted in: [user@ipa-master]$ kinit kinit: Generic error (see e-text) while getting initial credentials Snippet from the /var/log/dirsrv/slapd-[domain]/errors: [31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) [31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental protocol: event backoff_timer_expired should not occur in state start_backoff [31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. The access logs were filling up with: [31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 10.106.14.29 to 10.113.14.30 [31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 10.104.9.137 to 10.113.14.30 [31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 10.104.16.19 to 10.113.14.30 [31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 10.111.11.30 to 10.113.14.30 [31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 10.113.27.115 to 10.113.14.30 [31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 10.111.8.116 to 10.113.14.30 [31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 10.113.25.40 to 10.113.14.30 [31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 10.106.14.27 to 10.113.14.30 [31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 10.111.10.141 to 10.113.14.30 [31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 10.104.14.27 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 10.106.13.132 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 10.113.25.11 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 10.104.15.11 to 10.113.14.30 [31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 10.104.7.136 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 10.113.24.23 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 10.106.12.105 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 10.104.16.41 to 10.113.14.30 [31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 10.104.16.4 to 10.113.14.30 [31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 10.111.8.12 to 10.113.14.30 [31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 10.111.8.17 to 10.113.14.30 Seems like clients were connecting to the replicas b
Re: [Freeipa-users] replicas unresponsive with increasing file descriptors
On Tue, Sep 01, 2015 at 05:03:10PM +0200, Ludwig Krispenz wrote: > > On 09/01/2015 04:39 PM, Andrew E. Bruno wrote: > >A few months ago we had a replica failure where the system ran out of file > >descriptors and the slapd database was corrupted: > > > >https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html > > > >We now monitor file descriptor counts on our replicas and last night we > >had 2 of our 3 replicas fail and become completely unresponsive. Trying > >to kinit on the replica resulted in: > > > >[user@ipa-master]$ kinit > >kinit: Generic error (see e-text) while getting initial credentials > > > > > >Snippet from the /var/log/dirsrv/slapd-[domain]/errors: > > > >[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: > >Attempting to release replica, but unable to receive endReplication extended > >operation response from the replica. Error -5 (Timed out) > >[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental > >protocol: event backoff_timer_expired should not occur in state start_backoff > >[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > >[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - > >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to > >receive the response for a startReplication extended operation to consumer > >(Timed out). Will retry later. > > > >The access logs were filling up with: > > > >[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from > >10.106.14.29 to 10.113.14.30 > >[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from > >10.104.9.137 to 10.113.14.30 > >[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from > >10.104.16.19 to 10.113.14.30 > >[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from > >10.111.11.30 to 10.113.14.30 > >[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from > >10.113.27.115 to 10.113.14.30 > >[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from > >10.111.8.116 to 10.113.14.30 > >[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from > >10.113.25.40 to 10.113.14.30 > >[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from > >10.106.14.27 to 10.113.14.30 > >[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from > >10.111.10.141 to 10.113.14.30 > >[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from > >10.104.14.27 to 10.113.14.30 > >[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from > >10.106.13.132 to 10.113.14.30 > >[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from > >10.113.25.11 to 10.113.14.30 > >[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from > >10.104.15.11 to 10.113.14.30 > >[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from > >10.104.7.136 to 10.113.14.30 > >[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from > >10.113.24.23 to 10.113.14.30 > >[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from > >10.106.12.105 to 10.113.14.30 > >[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from > >10.104.16.41 to 10.113.14.30 > >[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from > >10.104.16.4 to 10.113.14.30 > >[31/Aug/2015:17:13:35 -0400] conn=1386008
Re: [Freeipa-users] replicas unresponsive with increasing file descriptors
On 09/01/2015 04:39 PM, Andrew E. Bruno wrote: A few months ago we had a replica failure where the system ran out of file descriptors and the slapd database was corrupted: https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html We now monitor file descriptor counts on our replicas and last night we had 2 of our 3 replicas fail and become completely unresponsive. Trying to kinit on the replica resulted in: [user@ipa-master]$ kinit kinit: Generic error (see e-text) while getting initial credentials Snippet from the /var/log/dirsrv/slapd-[domain]/errors: [31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) [31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental protocol: event backoff_timer_expired should not occur in state start_backoff [31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. The access logs were filling up with: [31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 10.106.14.29 to 10.113.14.30 [31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 10.104.9.137 to 10.113.14.30 [31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 10.104.16.19 to 10.113.14.30 [31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 10.111.11.30 to 10.113.14.30 [31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 10.113.27.115 to 10.113.14.30 [31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 10.111.8.116 to 10.113.14.30 [31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 10.113.25.40 to 10.113.14.30 [31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 10.106.14.27 to 10.113.14.30 [31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 10.111.10.141 to 10.113.14.30 [31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 10.104.14.27 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 10.106.13.132 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 10.113.25.11 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 10.104.15.11 to 10.113.14.30 [31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 10.104.7.136 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 10.113.24.23 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 10.106.12.105 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 10.104.16.41 to 10.113.14.30 [31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 10.104.16.4 to 10.113.14.30 [31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 10.111.8.12 to 10.113.14.30 [31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 10.111.8.17 to 10.113.14.30 Seems like clients were connecting to the replicas but file descriptors were not getting released. Our monitoring showed increasing file descriptor counts on both re
[Freeipa-users] replicas unresponsive with increasing file descriptors
A few months ago we had a replica failure where the system ran out of file descriptors and the slapd database was corrupted: https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html We now monitor file descriptor counts on our replicas and last night we had 2 of our 3 replicas fail and become completely unresponsive. Trying to kinit on the replica resulted in: [user@ipa-master]$ kinit kinit: Generic error (see e-text) while getting initial credentials Snippet from the /var/log/dirsrv/slapd-[domain]/errors: [31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) [31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental protocol: event backoff_timer_expired should not occur in state start_backoff [31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. The access logs were filling up with: [31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 10.106.14.29 to 10.113.14.30 [31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 10.104.9.137 to 10.113.14.30 [31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 10.104.16.19 to 10.113.14.30 [31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 10.111.11.30 to 10.113.14.30 [31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 10.113.27.115 to 10.113.14.30 [31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 10.111.8.116 to 10.113.14.30 [31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 10.113.25.40 to 10.113.14.30 [31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 10.106.14.27 to 10.113.14.30 [31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 10.111.10.141 to 10.113.14.30 [31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 10.104.14.27 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 10.106.13.132 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 10.113.25.11 to 10.113.14.30 [31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 10.104.15.11 to 10.113.14.30 [31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 10.104.7.136 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 10.113.24.23 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 10.106.12.105 to 10.113.14.30 [31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 10.104.16.41 to 10.113.14.30 [31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 10.104.16.4 to 10.113.14.30 [31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 10.111.8.12 to 10.113.14.30 [31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 10.111.8.17 to 10.113.14.30 Seems like clients were connecting to the replicas but file descriptors were not getting released. Our monitoring showed increasing file descriptor counts on both replicas (the FD counts are normally ~600): DateTim