Has somebody seen this problem as well?

-Reinhard

From: [email protected] 
[mailto:[email protected]] On Behalf Of Reinhard Nappert
Sent: Friday, August 03, 2012 2:51 PM
To: [email protected]
Subject: [389-users] MMR issue

Hi,

I have the following 389 DS version deployed:  389-Directory/1.2.8.2 
B2011.130.190

I have a 3 box multi-master replication setup in a ring:


              \     /        \     /           \     /       \     /      \     
/
           ...   C   -----   A    -----    B   -----  C   ----- A ...
              /      \       /      \          /      \     /      \     /      
\

The replication agreements for "A" and "C" and for "B" and "C" work fine, but I 
have an issue for the agreements for the "A" and "B" connection.

I see the following in the errors file:

Server A:
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 
repl="o=base": Begin incremental protocol
[19/Jul/2012:07:28:50 -0300] - csngen_adjust_time: gen state before 
5007e1610000:1342693727:0:2
[19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state before 
5007e1610000:1342693727:0:2
[19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state after 
5007e1640000:1342693730:0:2
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 
repl="o=BASE": Replica in use locking_purl=conn=7831 id=3
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 
replica="o=BASE": Unable to acquire replica: error: replica busy locked by 
conn=7831 id=3 for incremental update
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 
repl="o=base": StartNSDS90ReplicationRequest: response=1 rc=0

This kind of error is logged in an interval of about 1 second, where the 
local_time differs 5007e1610000:1342693727:0:2


Server B:
[19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Can't contact LDAP server). Will retry later.
[19/Jul/2012:13:44:25 -0300] slapi_ldap_bind - Error: timeout after [0.0] 
seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]
[19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): 
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): 
Replication bind with SIMPLE auth resumed

Sometimes, I also see the following error
[20/Jul/2012:11:28:39 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn= replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[20/Jul/2012:11:28:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): 
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the 
LDAP server) ((null))
[20/Jul/2012:11:30:30 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): 
Replication bind with SIMPLE auth resumed

I don't see any indication that Server B was down at that time.

I did see the Bug 571677 (https://bugzilla.redhat.com/show_bug.cgi?id=571677), 
but there was no deletion of a replicaconflict object.

Did anybody encounter this kind of issue? The next question would be: How to 
recover the MMR environment.

Thanks,
-Reinhard



--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users

Reply via email to