Hey I'm comming into this conversation a little late but could there be a max open file handle issue on server B. That's the first thing that poped into my mind when I read the description of the issue. On Nov 13, 2012 3:57 PM, "Reinhard Nappert" <[email protected]> wrote:
> There are a lot of RST for the time frame, especially on Server A. **** > > ** ** > > I also see a lot TCP Retransmission for this particular replication > session. I also see a couple of TCP Dup ACKs. Is this enough to cause the > issue?**** > > ** ** > > Of course, I don’t have an idea, what is the cause of that!**** > > ** ** > > *From:* Rich Megginson [mailto:[email protected]] > *Sent:* Tuesday, November 13, 2012 1:57 PM > *To:* Reinhard Nappert > *Cc:* General discussion list for the 389 Directory server project. > *Subject:* Re: [389-users] MMR issue ...**** > > ** ** > > On 11/13/2012 11:53 AM, Reinhard Nappert wrote: **** > > How would you proceed to figure out what is going on there?**** > > > Since it doesn't appear that the replication logs are giving enough > information, and you don't see any disconnects or TCP resets happening in > the packet capture, then I guess you have no choice but to familiarize > yourself with the source code and use gdb. > > > **** > > **** > > You see that I ran out of ideas!**** > > **** > > Thanks**** > > **** > > *From:* Rich Megginson [mailto:[email protected] <[email protected]>] > *Sent:* Tuesday, November 13, 2012 1:32 PM > *To:* Reinhard Nappert > *Cc:* General discussion list for the 389 Directory server project. > *Subject:* Re: [389-users] MMR issue ...**** > > **** > > On 11/13/2012 11:21 AM, Reinhard Nappert wrote: **** > > The 3 servers do not crash.**** > > **** > > I am not sure about the network, though. My first assumption was that the > firewall (between A and B) might cause the issue. The latest occurrence > (the one, I described) had the firewall removed. I see quite some TCP > Retransmissions in the packet captures. Could that be the issue?**** > > > That could be, although that would mean there are so many tcp > retransmissions that take such a long time to process that it causes the > application to think the network connection has timed out. > > > > **** > > **** > > -Reinhard**** > > **** > > *From:* Rich Megginson [mailto:[email protected] <[email protected]>] > *Sent:* Tuesday, November 13, 2012 1:15 PM > *To:* General discussion list for the 389 Directory server project. > *Cc:* Reinhard Nappert > *Subject:* Re: [389-users] MMR issue ...**** > > **** > > On 11/13/2012 11:02 AM, Reinhard Nappert wrote: **** > > Rich,**** > > **** > > Do you know what the cause of this issue is?**** > > > No, I don't know. > > > > > **** > > You would expect that you saw this issue in different deployments, but I > only saw it in one instance. **** > > **** > > If it turns out that the issue I see is identical the issue, you > mentioned, I’d like to know, when it was fixed.**** > > > Upon further investigation, this does not appear to be the same as > https://fedorahosted.org/389/ticket/374 > > I'm not sure what the problem is. I've seen timeouts when servers crash > or there are network issues. > > > > > **** > > **** > > Thanks,**** > > -Reinhard**** > > **** > > *From:* [email protected] [ > mailto:[email protected]<[email protected]>] > *On Behalf Of *Reinhard Nappert > *Sent:* Tuesday, November 13, 2012 12:22 PM > *To:* Rich Megginson; General discussion list for the 389 Directory > server project. > *Subject:* Re: [389-users] MMR issue ...**** > > **** > > I use 1.2.8.2**** > > **** > > *From:* Rich Megginson [mailto:[email protected] <[email protected]>] > *Sent:* Tuesday, November 13, 2012 12:18 PM > *To:* General discussion list for the 389 Directory server project. > *Cc:* Reinhard Nappert > *Subject:* Re: [389-users] MMR issue ...**** > > **** > > On 11/13/2012 09:24 AM, Reinhard Nappert wrote: **** > > Hi,**** > > **** > > I’ve encountered issues with a MMR setup, which looks like the following:* > *** > > **** > > A ------- B**** > > \ /**** > > \ /**** > > \ /**** > > C**** > > **** > > The replication works for approximately 24 hours. There are not many > changes to the content anyway. After about 1 day, the attribute value of > the type “nsds5replicaLastUpdateStatus” changes to “1 Can't acquire busy > replica “ of the replication agreement object from type > “nsDS5ReplicationAgreement”. I see this message on C for the agreement > “C-to-B”. The start-time of the last update is 01:08:33. When I check the > status on B, it looks fine for “B-to-C” and “B-to-A”, however, the > start-time of the last update is stuck at 01:08:36 for “B-to-C”, whereas A > gets updated afterwards as well. I don’t have the values for A!**** > > **** > > When, I check errors and access on the boxes, I see the following:**** > > **** > > Errors on A:**** > > [10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Warning: unable to receive endReplication extended operation > response (Timed out)**** > > [10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Can't contact LDAP server). Will retry later.**** > > [10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > [10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Can't contact LDAP server). Will retry later.**** > > [10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > [10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Can't contact LDAP server). Will retry later.**** > > [10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't > connect to the LDAP server) ((null))**** > > [10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > [10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't > connect to the LDAP server) ((null))**** > > [10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after [0.0] > seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] > **** > > [10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed > out) ((null))**** > > [10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't > connect to the LDAP server) ((null))**** > > [10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > [10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Can't contact LDAP server). Will retry later.**** > > [10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > [10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Timed out). Will retry later.**** > > [10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Unable to receive the response for a startReplication extended > operation to consumer (Can't contact LDAP server). Will retry later.**** > > [10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > [10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't > connect to the LDAP server) ((null))**** > > [10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after [0.0] > seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] > **** > > [10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed > out) ((null))**** > > [10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't > connect to the LDAP server) ((null))**** > > [10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send bind > request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't > connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 > (Operation now in progress)**** > > [10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after [0.0] > seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] > **** > > [10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed > out) ((null))**** > > [10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" > (B:389): Replication bind with SIMPLE auth resumed**** > > **** > > Errors on B:**** > > [10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" > (A:389): Replication bind with SIMPLE auth resumed**** > > **** > > Errors on C:**** > > No entry for the time**** > > **** > > I also analyzed the access files and see the following:**** > > From access on A:**** > > I see replication sessions from B and C, established at 01:00:10 and > closed at 01:01:14.**** > > Next replication session from C, established at 01:05:12 and closed at > 01:06:12. There was NO session from B.**** > > B and C establish the next replication session at 01:07:52. During the > session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. > The session from B stays open, and I see a ABANDOM request from A at > 01:17:53. Eventually, the session gets closed (no unbind) at 01:23:25.**** > > [10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND > msgid=3**** > > Replication sessions from C continue to work fine.**** > > There is one more replication session from B, established at 01:27:56 and > closed at 01:28:38. After this session, B does not establish any > replication session anymore.**** > > **** > > From access on B:**** > > I see replication sessions from A and C, established at 01:00:09 and > closed at 01:01:14.**** > > Next replication session from C, established at 01:05:12 and closed at > 01:06:12. There was NO session from A.**** > > A and C establish the next replication session at 01:07:52. During the > session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. > The session from A stays open forever!!!**** > > Replication sessions from C continue to work fine.**** > > **** > > From access on C:**** > > I see replication sessions from A and B, established at 01:00:09 and > closed at 01:01:14.**** > > I don’t see any replication session from the 01:05 time-slot**** > > A and B establish the next replication session at 01:07:53. During the > session, a MOD is performed at 01:08:33. A closes this session at 01:09:33 > and B closes session at 01:09:38**** > > From now on, I only see replication sessions from A!**** > > **** > > Has anybody seen this kind of behavior? Any feedback is highly appreciated. > **** > > This looks like https://fedorahosted.org/389/ticket/374 > > What version of 389-ds-base?**** > > **** > > Thanks,**** > > -Reinhard**** > > > > > > > **** > > --**** > > 389 users mailing list**** > > [email protected]**** > > https://admin.fedoraproject.org/mailman/listinfo/389-users**** > > **** > > > > > > > **** > > --**** > > 389 users mailing list**** > > [email protected]**** > > https://admin.fedoraproject.org/mailman/listinfo/389-users**** > > **** > > **** > > ** ** > > -- > 389 users mailing list > [email protected] > https://admin.fedoraproject.org/mailman/listinfo/389-users >
-- 389 users mailing list [email protected] https://admin.fedoraproject.org/mailman/listinfo/389-users
