On 04 May 2014, at 9:35 PM, David Boreham <[email protected]> wrote:
> ottomh I can't think what's up with your servers, but for sure there's > something very odd going on (beyond just "not configured correctly" I > suspect). The errors where the replica initialization reports that the > database index files have been deleted underneath it are particularly wacky. > > Anyway, I'd recommend that you turn up the logging level on the servers. This > will hopefully reveal more about what's going wrong. Also look carefully in > the log for the _first_ sign of trouble. That will likely be the root cause. > I suspect you have a lot of fog of war showing up that's generated by some > underlying prime error, that aught to have appeared first in the timeline. > > It should be possible to add an N+1th replica to an N-node deployment. > Replication agreements are peer-to-peer, so you just add a new replication > agreement from each of the servers you want to feed changes to the N+1th > (typically all of them). > > In the log messages, where you were wondering which consumer server is > throwing the error, the name of the replication agreement is typically > printed. So the server it is trying to talk to is the one that's the target > for the replication agreement mentioned. On the serverc machine, when serverb has been instructed to initialise serverc, we see the following with replication debugging enabled: [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 Acquired consumer connection extension [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 repl="o=Foo,c=za": Begin total protocol [04/May/2014:23:03:36 +0200] - csngen_adjust_time: gen state before 5366aac20001:1399237314:0:0 [04/May/2014:23:03:36 +0200] - _csngen_adjust_local_time: gen state before 5366aac20001:1399237314:0:0 [04/May/2014:23:03:36 +0200] - _csngen_adjust_local_time: gen state after 5366ab280000:1399237416:0:0 [04/May/2014:23:03:36 +0200] - csngen_adjust_time: gen state after 536708fc0001:1399237416:0:24020 [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 repl="o=Foo,c=ZA": Acquired replica [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - multimaster_be_state_change: replica o=Foo,c=za is going offline; disabling replication [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=0 op=0 repl="o=Foo,c=ZA": Replica in use locking_purl=conn=1 id=3 [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - replica_disable_replication: replica o=Foo,c=za is acquired [04/May/2014:23:03:36 +0200] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 repl="o=Foo,c=za": StartNSDS90ReplicationRequest: response=0 rc=0 [04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 Relinquishing consumer connection extension Nothing in the above seems to indicate an error that I can see, but we now see this two seconds later: [04/May/2014:23:03:38 +0200] - ERROR bulk import abandoned [04/May/2014:23:03:38 +0200] - import userRoot: Aborting all Import threads... [04/May/2014:23:03:43 +0200] - import userRoot: Import threads aborted. [04/May/2014:23:03:43 +0200] - import userRoot: Closing files... [04/May/2014:23:03:43 +0200] - libdb: userRoot/uid.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/mailAlternateAddress.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/cn.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/mail.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/givenName.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/aci.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/entryrdn.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/id2entry.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/sn.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/parentid.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/objectclass.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/nsuniqueid.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/mailHost.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - libdb: userRoot/uniquemember.db4: unable to flush: No such file or directory [04/May/2014:23:03:43 +0200] - import userRoot: Import failed. [04/May/2014:23:03:43 +0200] NSMMReplicationPlugin - Aborting total update in progress for replicated area o=Foo,c=ZA connid=1 [04/May/2014:23:03:43 +0200] - process_bulk_import_op: NULL target sdn [04/May/2014:23:03:43 +0200] NSMMReplicationPlugin - conn=1 op=-1 repl="o=Foo,c=ZA": Released replica held by locking_purl=conn=1 id=3 It appears that despite the initial sync as having failed, we ignore the above error and pretend all is ok, I suspect this is why we're getting the weird messages below. [04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 Acquired consumer connection extension [04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 repl="o=Foo,c=za": Begin incremental protocol [04/May/2014:23:03:45 +0200] - csngen_adjust_time: gen state before 536708fc0001:1399237416:0:24020 [04/May/2014:23:03:45 +0200] - _csngen_adjust_local_time: gen state before 536708fc0001:1399237416:0:24020 [04/May/2014:23:03:45 +0200] - _csngen_adjust_local_time: gen state after 536709050000:1399237425:0:24020 [04/May/2014:23:03:45 +0200] - csngen_adjust_time: gen state after 536709050000:1399237425:0:24020 [04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 repl="o=Foo,c=ZA": Acquired replica [04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 repl="o=Foo,c=za": StartNSDS90ReplicationRequest: response=0 rc=0 [04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 Relinquishing consumer connection extension [04/May/2014:23:03:46 +0200] NSMMReplicationPlugin - conn=2 op=4 Acquired consumer connection extension [04/May/2014:23:03:46 +0200] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica o=Foo,c=ZA: LDAP error - 1 Turning on connection debugging shows a whole lot of data transfer, then an anomaly just before the ERROR bulk import abandoned: [04/May/2014:23:14:33 +0200] - listener got signaled [04/May/2014:23:14:33 +0200] - activity on 64r [04/May/2014:23:14:33 +0200] - read activity on 64 [04/May/2014:23:14:33 +0200] - listener got signaled [04/May/2014:23:14:33 +0200] - activity on 64r [04/May/2014:23:14:33 +0200] - read activity on 64 [04/May/2014:23:14:33 +0200] - ber_get_next failed for connection 2 [04/May/2014:23:14:33 +0200] - conn 2 leaving turbo mode due to 3 [04/May/2014:23:14:33 +0200] - listener got signaled [04/May/2014:23:14:33 +0200] - ERROR bulk import abandoned We got "ber_get_next failed for connection 2" without being told what exactly failed. Will dig on the other side to see if there are any clues. Regards, Graham -- -- 389 users mailing list [email protected] https://admin.fedoraproject.org/mailman/listinfo/389-users
