Re: [389-users] replication from 1.2.8.3 to 1.2.10.4

Robert Viduya Thu, 12 Jul 2012 07:50:57 -0700

On Jul 11, 2012, at 7:17 PM, Rich Megginson wrote:

> On 07/11/2012 11:12 AM, Robert Viduya wrote:
>> Is replication from a 1.2.8.3 server to a 1.2.10.4 server known to work or 
>> not work?  We're having changelog issues.
>> 
>> Background:
>> 
>> We have an ldap service consisting of 3 masters, 2 hubs and 16 slaves.  All 
>> were running 1.2.8.3 since last summer with no issues.  This summer, we 
>> decided to bring them all up to the latest stable release, 1.2.10.4.  We 
>> can't afford a lot of downtime for the service as a whole, but with the 
>> redundancy level we have, we can take down a machine or two at a time 
>> without user impact.
>> 
>> We started with one slave, did a clean install of 1.2.10.4 on it, set up 
>> replication agreements from our 1.2.8.3 hubs to it and watched it for a week 
>> or so.  Everything looked fine, so we started rolling through the rest of 
>> the slave servers, got them all running 1.2.10.4 and so far haven't seen any 
>> problems.
>> 
>> A couple of days ago, I did one of our two hubs.  The first time I bring up 
>> the daemon after doing the initial import of our ldap data everything seems 
>> fine.  However, we start seeing errors the first time we restart:
>> 
>> [11/Jul/2012:10:43:58 -0400] - slapd shutting down - signaling operation 
>> threads
>> [11/Jul/2012:10:43:58 -0400] - slapd shutting down - waiting for 2 threads 
>> to terminate
>> [11/Jul/2012:10:44:01 -0400] - slapd shutting down - closing down internal 
>> subsystems and plugins
>> [11/Jul/2012:10:44:02 -0400] - Waiting for 4 database threads to stop
>> [11/Jul/2012:10:44:04 -0400] - All database threads now stopped
>> [11/Jul/2012:10:44:04 -0400] - slapd stopped.
>> [11/Jul/2012:10:45:00 -0400] - 389-Directory/1.2.10.4 B2012.101.2023 
>> starting up
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the 
>> max CSN [4ffdca7e000000330000] from RUV [changelog max RUV] is larger than 
>> the max CSN [4ffb605d000000330000] from RUV [database RUV] for element 
>> [{replica 51} 4ffb602b000300330000 4ffdca7e000000330000]
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - 
>> replica_check_for_data_reload: Warning: data for replica 
>> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not 
>> match the data in the changelog. Recreating the changelog file. This could 
>> affect replication with replica's consumers in which case the consumers 
>> should be reinitialized.
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the 
>> max CSN [4ffdca70000000340000] from RUV [changelog max RUV] is larger than 
>> the max CSN [4ffb7098000100340000] from RUV [database RUV] for element 
>> [{replica 52} 4ffb6ea2000000340000 4ffdca70000000340000]
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - 
>> replica_check_for_data_reload: Warning: data for replica 
>> ou=people,dc=gted,dc=gatech,dc=edu does not match the data in the changelog. 
>> Recreating the changelog file. This could affect replication with replica's 
>> consumers in which case the consumers should be reinitialized.
>> [11/Jul/2012:10:45:08 -0400] - slapd started.  Listening on All Interfaces 
>> port 389 for LDAP requests
>> [11/Jul/2012:10:45:08 -0400] - Listening on All Interfaces port 636 for 
>> LDAPS requests
> 
> The problem is that hubs have changelogs but dedicated consumers do not.
> 
> Were either of the replicas with ID 51 or 52 removed/deleted at some point in 
> the past?


No, 51 and 52 belong to an active, functional master.

> 
>> 
>> The _second_ restart is even worse, we get more error messages (see below) 
>> and then the daemon dies
> 
> Dies?  Exits?  Crashes?  Core files?  Do you see any ns-slapd segfault 
> messages in /var/log/messages?  When you restart the directory server after 
> it dies, do you see "Disorderly Shutdown" messages in the directory server 
> errors log?

Found these in the kernel log file:

Jul 11 10:46:26 bellar kernel: ns-slapd[4041]: segfault at 0000000000000011 rip 
00002b5fe0801857 rsp 0000000076e65970 error 4
Jul 11 10:47:23 bellar kernel: ns-slapd[4714]: segfault at 0000000000000011 rip 
00002b980c6ce857 rsp 00000000681f5970 error 4

And yes, we get "Disorderly Shutdown" messages in the errors log.

> 
> 
>> after it says it's listening on it's ports:
>> 
>> [11/Jul/2012:10:45:32 -0400] - slapd shutting down - signaling operation 
>> threads
>> [11/Jul/2012:10:45:32 -0400] - slapd shutting down - waiting for 29 threads 
>> to terminate
>> [11/Jul/2012:10:45:34 -0400] - slapd shutting down - closing down internal 
>> subsystems and plugins
>> [11/Jul/2012:10:45:35 -0400] - Waiting for 4 database threads to stop
>> [11/Jul/2012:10:45:36 -0400] - All database threads now stopped
>> [11/Jul/2012:10:45:36 -0400] - slapd stopped.
>> [11/Jul/2012:10:46:11 -0400] - 389-Directory/1.2.10.4 B2012.101.2023 
>> starting up
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV 
>> [changelog max RUV] does not contain element [{replica 68 
>> ldap://gtedm3.iam.gatech.edu:389} 4be339e6000000440000 4ffdc9a1000000440000] 
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV 
>> [changelog max RUV] does not contain element [{replica 71 
>> ldap://gtedm4.iam.gatech.edu:389} 4be6031e000000470000 4ffdc9a8000000470000] 
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the 
>> max CSN [4ffb62a2000100330000] from RUV [changelog max RUV] is larger than 
>> the max CSN [4ffb605d000000330000] from RUV [database RUV] for element 
>> [{replica 51} 4ffb605d000000330000 4ffb62a2000100330000]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - 
>> replica_check_for_data_reload: Warning: data for replica 
>> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not 
>> match the data in the changelog. Recreating the changelog file. This could 
>> affect replication with replica's consumers in which case the consumers 
>> should be reinitialized.
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV 
>> [changelog max RUV] does not contain element [{replica 69 
>> ldap://gtedm3.iam.gatech.edu:389} 4be339e4000000450000 4ffdc9a2000000450000] 
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV 
>> [changelog max RUV] does not contain element [{replica 72 
>> ldap://gtedm4.iam.gatech.edu:389} 4be6031d000000480000 4ffdc9a9000300480000] 
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the 
>> max CSN [4ffb78bc000000340000] from RUV [changelog max RUV] is larger than 
>> the max CSN [4ffb7098000100340000] from RUV [database RUV] for element 
>> [{replica 52} 4ffb7098000100340000 4ffb78bc000000340000]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - 
>> replica_check_for_data_reload: Warning: data for replica 
>> ou=people,dc=gted,dc=gatech,dc=edu does not match the data in the changelog. 
>> Recreating the changelog file. This could affect replication with replica's 
>> consumers in which case the consumers should be reinitialized.
>> [11/Jul/2012:10:46:11 -0400] - slapd started.  Listening on All Interfaces 
>> port 389 for LDAP requests
>> [11/Jul/2012:10:46:11 -0400] - Listening on All Interfaces port 636 for 
>> LDAPS requests
>> 
>> At this point, the only way I've found to get it back is to clean out the 
>> changelog and db directories and re-import the ldap data from scratch.  
>> Essentially we can't restart without having to re-import.  I've done this a 
>> couple of times already and it's entirely reproducible.
> So every time you shutdown the server, and attempt to restart it, it doesn't 
> start until you re-import?

No, the first restart works, but we get changelog errors in the log file.  
Subsequent restarts don't work at all without rebuilding everything.

>> 
>> I've checked and ensured that there's no obsolete masters that need to be 
>> CLEANRUVed.  I've also noticed that the errors _seem_ to be only affecting 
>> our second and third suffix.  We have three suffixes defined, but I haven't 
>> seen any error messages for the first one.
>> 
>> Has anyone seen anything like this?  We're not sure if this is a general 
>> 1.2.10.4 issue or if it only occurs if when replicating from 1.2.8.3 to 
>> 1.2.10.4.  If it's the former, we cannot proceed with getting the rest of 
>> the servers up to 1.2.10.4.  If it's the latter, then we need to expedite 
>> getting everything up to 1.2.10.4.
> 
> These do not seem like issues related to replicating from 1.2.8 to 1.2.10.  
> Have you tried a simple test of setting up 2 1.2.10 masters and attempting to 
> replicate your data between them?

Not yet, I may try this next, but it will take some time to set up.

> 
>> --
>> 389 users mailing list
>> [email protected]
>> https://admin.fedoraproject.org/mailman/listinfo/389-users
> 

--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] replication from 1.2.8.3 to 1.2.10.4

Reply via email to