On Jul 11, 2012, at 7:17 PM, Rich Megginson wrote:
> On 07/11/2012 11:12 AM, Robert Viduya wrote:
>> Is replication from a 1.2.8.3 server to a 1.2.10.4 server known to work or
>> not work? We're having changelog issues.
>>
>> Background:
>>
>> We have an ldap service consisting of 3 masters, 2 hubs and 16 slaves. All
>> were running 1.2.8.3 since last summer with no issues. This summer, we
>> decided to bring them all up to the latest stable release, 1.2.10.4. We
>> can't afford a lot of downtime for the service as a whole, but with the
>> redundancy level we have, we can take down a machine or two at a time
>> without user impact.
>>
>> We started with one slave, did a clean install of 1.2.10.4 on it, set up
>> replication agreements from our 1.2.8.3 hubs to it and watched it for a week
>> or so. Everything looked fine, so we started rolling through the rest of
>> the slave servers, got them all running 1.2.10.4 and so far haven't seen any
>> problems.
>>
>> A couple of days ago, I did one of our two hubs. The first time I bring up
>> the daemon after doing the initial import of our ldap data everything seems
>> fine. However, we start seeing errors the first time we restart:
>>
>> [11/Jul/2012:10:43:58 -0400] - slapd shutting down - signaling operation
>> threads
>> [11/Jul/2012:10:43:58 -0400] - slapd shutting down - waiting for 2 threads
>> to terminate
>> [11/Jul/2012:10:44:01 -0400] - slapd shutting down - closing down internal
>> subsystems and plugins
>> [11/Jul/2012:10:44:02 -0400] - Waiting for 4 database threads to stop
>> [11/Jul/2012:10:44:04 -0400] - All database threads now stopped
>> [11/Jul/2012:10:44:04 -0400] - slapd stopped.
>> [11/Jul/2012:10:45:00 -0400] - 389-Directory/1.2.10.4 B2012.101.2023
>> starting up
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the
>> max CSN [4ffdca7e000000330000] from RUV [changelog max RUV] is larger than
>> the max CSN [4ffb605d000000330000] from RUV [database RUV] for element
>> [{replica 51} 4ffb602b000300330000 4ffdca7e000000330000]
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin -
>> replica_check_for_data_reload: Warning: data for replica
>> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not
>> match the data in the changelog. Recreating the changelog file. This could
>> affect replication with replica's consumers in which case the consumers
>> should be reinitialized.
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the
>> max CSN [4ffdca70000000340000] from RUV [changelog max RUV] is larger than
>> the max CSN [4ffb7098000100340000] from RUV [database RUV] for element
>> [{replica 52} 4ffb6ea2000000340000 4ffdca70000000340000]
>> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin -
>> replica_check_for_data_reload: Warning: data for replica
>> ou=people,dc=gted,dc=gatech,dc=edu does not match the data in the changelog.
>> Recreating the changelog file. This could affect replication with replica's
>> consumers in which case the consumers should be reinitialized.
>> [11/Jul/2012:10:45:08 -0400] - slapd started. Listening on All Interfaces
>> port 389 for LDAP requests
>> [11/Jul/2012:10:45:08 -0400] - Listening on All Interfaces port 636 for
>> LDAPS requests
>
> The problem is that hubs have changelogs but dedicated consumers do not.
>
> Were either of the replicas with ID 51 or 52 removed/deleted at some point in
> the past?
No, 51 and 52 belong to an active, functional master.
>
>>
>> The _second_ restart is even worse, we get more error messages (see below)
>> and then the daemon dies
>
> Dies? Exits? Crashes? Core files? Do you see any ns-slapd segfault
> messages in /var/log/messages? When you restart the directory server after
> it dies, do you see "Disorderly Shutdown" messages in the directory server
> errors log?
Found these in the kernel log file:
Jul 11 10:46:26 bellar kernel: ns-slapd[4041]: segfault at 0000000000000011 rip
00002b5fe0801857 rsp 0000000076e65970 error 4
Jul 11 10:47:23 bellar kernel: ns-slapd[4714]: segfault at 0000000000000011 rip
00002b980c6ce857 rsp 00000000681f5970 error 4
And yes, we get "Disorderly Shutdown" messages in the errors log.
>
>
>> after it says it's listening on it's ports:
>>
>> [11/Jul/2012:10:45:32 -0400] - slapd shutting down - signaling operation
>> threads
>> [11/Jul/2012:10:45:32 -0400] - slapd shutting down - waiting for 29 threads
>> to terminate
>> [11/Jul/2012:10:45:34 -0400] - slapd shutting down - closing down internal
>> subsystems and plugins
>> [11/Jul/2012:10:45:35 -0400] - Waiting for 4 database threads to stop
>> [11/Jul/2012:10:45:36 -0400] - All database threads now stopped
>> [11/Jul/2012:10:45:36 -0400] - slapd stopped.
>> [11/Jul/2012:10:46:11 -0400] - 389-Directory/1.2.10.4 B2012.101.2023
>> starting up
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV
>> [changelog max RUV] does not contain element [{replica 68
>> ldap://gtedm3.iam.gatech.edu:389} 4be339e6000000440000 4ffdc9a1000000440000]
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV
>> [changelog max RUV] does not contain element [{replica 71
>> ldap://gtedm4.iam.gatech.edu:389} 4be6031e000000470000 4ffdc9a8000000470000]
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the
>> max CSN [4ffb62a2000100330000] from RUV [changelog max RUV] is larger than
>> the max CSN [4ffb605d000000330000] from RUV [database RUV] for element
>> [{replica 51} 4ffb605d000000330000 4ffb62a2000100330000]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin -
>> replica_check_for_data_reload: Warning: data for replica
>> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not
>> match the data in the changelog. Recreating the changelog file. This could
>> affect replication with replica's consumers in which case the consumers
>> should be reinitialized.
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV
>> [changelog max RUV] does not contain element [{replica 69
>> ldap://gtedm3.iam.gatech.edu:389} 4be339e4000000450000 4ffdc9a2000000450000]
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV
>> [changelog max RUV] does not contain element [{replica 72
>> ldap://gtedm4.iam.gatech.edu:389} 4be6031d000000480000 4ffdc9a9000300480000]
>> which is present in RUV [database RUV]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the
>> max CSN [4ffb78bc000000340000] from RUV [changelog max RUV] is larger than
>> the max CSN [4ffb7098000100340000] from RUV [database RUV] for element
>> [{replica 52} 4ffb7098000100340000 4ffb78bc000000340000]
>> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin -
>> replica_check_for_data_reload: Warning: data for replica
>> ou=people,dc=gted,dc=gatech,dc=edu does not match the data in the changelog.
>> Recreating the changelog file. This could affect replication with replica's
>> consumers in which case the consumers should be reinitialized.
>> [11/Jul/2012:10:46:11 -0400] - slapd started. Listening on All Interfaces
>> port 389 for LDAP requests
>> [11/Jul/2012:10:46:11 -0400] - Listening on All Interfaces port 636 for
>> LDAPS requests
>>
>> At this point, the only way I've found to get it back is to clean out the
>> changelog and db directories and re-import the ldap data from scratch.
>> Essentially we can't restart without having to re-import. I've done this a
>> couple of times already and it's entirely reproducible.
> So every time you shutdown the server, and attempt to restart it, it doesn't
> start until you re-import?
No, the first restart works, but we get changelog errors in the log file.
Subsequent restarts don't work at all without rebuilding everything.
>>
>> I've checked and ensured that there's no obsolete masters that need to be
>> CLEANRUVed. I've also noticed that the errors _seem_ to be only affecting
>> our second and third suffix. We have three suffixes defined, but I haven't
>> seen any error messages for the first one.
>>
>> Has anyone seen anything like this? We're not sure if this is a general
>> 1.2.10.4 issue or if it only occurs if when replicating from 1.2.8.3 to
>> 1.2.10.4. If it's the former, we cannot proceed with getting the rest of
>> the servers up to 1.2.10.4. If it's the latter, then we need to expedite
>> getting everything up to 1.2.10.4.
>
> These do not seem like issues related to replicating from 1.2.8 to 1.2.10.
> Have you tried a simple test of setting up 2 1.2.10 masters and attempting to
> replicate your data between them?
Not yet, I may try this next, but it will take some time to set up.
>
>> --
>> 389 users mailing list
>> [email protected]
>> https://admin.fedoraproject.org/mailman/listinfo/389-users
>
--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users