On 04 May 2014, at 9:35 PM, David Boreham <[email protected]> wrote:

> ottomh I can't think what's up with your servers, but for sure there's 
> something very odd going on (beyond just "not configured correctly" I 
> suspect). The errors where the replica initialization reports that the 
> database index files have been deleted underneath it are particularly wacky.
> 
> Anyway, I'd recommend that you turn up the logging level on the servers. This 
> will hopefully reveal more about what's going wrong. Also look carefully in 
> the log for the _first_ sign of trouble. That will likely be the root cause. 
> I suspect you have a lot of fog of war showing up that's generated by some 
> underlying prime error, that aught to have appeared first in the timeline.
> 
> It should be possible to add an N+1th replica to an N-node deployment. 
> Replication agreements are peer-to-peer, so you just add a new replication 
> agreement from each of the servers you want to feed changes to the N+1th 
> (typically all of them).
> 
> In the log messages, where you were wondering which consumer server is 
> throwing the error, the name of the replication agreement is typically 
> printed. So the server it is trying to talk to is the one that's the target 
> for the replication agreement mentioned.

On the serverc machine, when serverb has been instructed to initialise serverc, 
we see the following with replication debugging enabled:

[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 Acquired 
consumer connection extension
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 
repl="o=Foo,c=za": Begin total protocol
[04/May/2014:23:03:36 +0200] - csngen_adjust_time: gen state before 
5366aac20001:1399237314:0:0
[04/May/2014:23:03:36 +0200] - _csngen_adjust_local_time: gen state before 
5366aac20001:1399237314:0:0
[04/May/2014:23:03:36 +0200] - _csngen_adjust_local_time: gen state after 
5366ab280000:1399237416:0:0
[04/May/2014:23:03:36 +0200] - csngen_adjust_time: gen state after 
536708fc0001:1399237416:0:24020
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 
repl="o=Foo,c=ZA": Acquired replica
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - 
multimaster_be_state_change: replica o=Foo,c=za is going offline; disabling 
replication
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=0 op=0 
repl="o=Foo,c=ZA": Replica in use locking_purl=conn=1 id=3
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - 
replica_disable_replication: replica o=Foo,c=za is acquired
[04/May/2014:23:03:36 +0200] - WARNING: Import is running with 
nsslapd-db-private-import-mem on; No other process is allowed to access the 
database
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 
repl="o=Foo,c=za": StartNSDS90ReplicationRequest: response=0 rc=0
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 Relinquishing 
consumer connection extension

Nothing in the above seems to indicate an error that I can see, but we now see 
this two seconds later:

[04/May/2014:23:03:38 +0200] - ERROR bulk import abandoned
[04/May/2014:23:03:38 +0200] - import userRoot: Aborting all Import threads...
[04/May/2014:23:03:43 +0200] - import userRoot: Import threads aborted.
[04/May/2014:23:03:43 +0200] - import userRoot: Closing files...
[04/May/2014:23:03:43 +0200] - libdb: userRoot/uid.db4: unable to flush: No 
such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/mailAlternateAddress.db4: unable 
to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/cn.db4: unable to flush: No such 
file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/mail.db4: unable to flush: No 
such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/givenName.db4: unable to flush: 
No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/aci.db4: unable to flush: No 
such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/entryrdn.db4: unable to flush: 
No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/id2entry.db4: unable to flush: 
No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/sn.db4: unable to flush: No such 
file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/parentid.db4: unable to flush: 
No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/objectclass.db4: unable to 
flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/nsuniqueid.db4: unable to flush: 
No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/mailHost.db4: unable to flush: 
No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/uniquemember.db4: unable to 
flush: No such file or directory
[04/May/2014:23:03:43 +0200] - import userRoot: Import failed.
[04/May/2014:23:03:43 +0200] NSMMReplicationPlugin - Aborting total update in 
progress for replicated area o=Foo,c=ZA connid=1
[04/May/2014:23:03:43 +0200] - process_bulk_import_op: NULL target sdn
[04/May/2014:23:03:43 +0200] NSMMReplicationPlugin - conn=1 op=-1 
repl="o=Foo,c=ZA": Released replica held by locking_purl=conn=1 id=3

It appears that despite the initial sync as having failed, we ignore the above 
error and pretend all is ok, I suspect this is why we're getting the weird 
messages below.

[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 Acquired 
consumer connection extension
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 
repl="o=Foo,c=za": Begin incremental protocol
[04/May/2014:23:03:45 +0200] - csngen_adjust_time: gen state before 
536708fc0001:1399237416:0:24020
[04/May/2014:23:03:45 +0200] - _csngen_adjust_local_time: gen state before 
536708fc0001:1399237416:0:24020
[04/May/2014:23:03:45 +0200] - _csngen_adjust_local_time: gen state after 
536709050000:1399237425:0:24020
[04/May/2014:23:03:45 +0200] - csngen_adjust_time: gen state after 
536709050000:1399237425:0:24020
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 
repl="o=Foo,c=ZA": Acquired replica
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 
repl="o=Foo,c=za": StartNSDS90ReplicationRequest: response=0 rc=0
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 Relinquishing 
consumer connection extension
[04/May/2014:23:03:46 +0200] NSMMReplicationPlugin - conn=2 op=4 Acquired 
consumer connection extension
[04/May/2014:23:03:46 +0200] NSMMReplicationPlugin - 
replica_replace_ruv_tombstone: failed to update replication update vector for 
replica o=Foo,c=ZA: LDAP error - 1

Turning on connection debugging shows a whole lot of data transfer, then an 
anomaly just before the ERROR bulk import abandoned:

[04/May/2014:23:14:33 +0200] - listener got signaled
[04/May/2014:23:14:33 +0200] - activity on 64r
[04/May/2014:23:14:33 +0200] - read activity on 64
[04/May/2014:23:14:33 +0200] - listener got signaled
[04/May/2014:23:14:33 +0200] - activity on 64r
[04/May/2014:23:14:33 +0200] - read activity on 64
[04/May/2014:23:14:33 +0200] - ber_get_next failed for connection 2
[04/May/2014:23:14:33 +0200] - conn 2 leaving turbo mode due to 3
[04/May/2014:23:14:33 +0200] - listener got signaled
[04/May/2014:23:14:33 +0200] - ERROR bulk import abandoned

We got "ber_get_next failed for connection 2" without being told what exactly 
failed. Will dig on the other side to see if there are any clues.

Regards,
Graham
--

--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users

Reply via email to