Re: [389-users] replication from 1.2.8.3 to 1.2.10.4

Rich Megginson Thu, 12 Jul 2012 13:53:23 -0700

On 07/12/2012 02:47 PM, Robert Viduya wrote:

On Jul 12, 2012, at 11:36 AM, Rich Megginson wrote:

On 07/12/2012 08:50 AM, Robert Viduya wrote:

On Jul 11, 2012, at 7:17 PM, Rich Megginson wrote:

On 07/11/2012 11:12 AM, Robert Viduya wrote:

So is it possible that the hub was

This question seems incomplete?

Sorry, I didn't mean to send that.

ok - please follow the directions at 
http://port389.org/wiki/FAQ#Debugging_Crashes to enable core files and get a 
stack trace

Also, 1.2.10.12 is available in the testing repos.  Please give this a try.  
There were a couple of fixes made since 1.2.10.4 that may be applicable:

Ticket 336 [abrt] 389-ds-base-1.2.10.4-2.fc16: index_range_read_ext: Process 
/usr/sbin/ns-slapd was killed by signal 11 (SIGSEGV)
Ticket #347 - IPA dirsvr seg-fault during system longevity test
Ticket #348 - crash in ldap_initialize with multiple threads
Ticket #361: Bad DNs in ACIs can segfault ns-slapd
Trac Ticket #359 - Database RUV could mismatch the one in changelog under the 
stress
Ticket #382 - DS Shuts down intermittently
Ticket #390 - [abrt] 389-ds-base-1.2.10.6-1.fc16: slapi_attr_value_cmp: Process 
/usr/sbin/ns-slapd was killed by signal 11 (SIGSEGV

I've enabled the core dump stuff, but now I can't seem to get it to crash.  But 
I'm still getting the changelog messages in the error logs whenever I restart.  
In addition, the hub server keeps running out of disk space.  I tracked it down 
to the access log filling up with MOD messages from replication.  It looks like 
changes are coming down from our 1.2.8 servers and being applied over and over 
again.  As an example, one of our entries was modified three times today, and 
on all our other machines I see the following in the access log file:

# egrep 78b8cc871a3cda9f352580e797b270bc access
[12/Jul/2012:11:00:59 -0400] conn=383671 op=3145 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:11:01:24 -0400] conn=383671 op=3153 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:11:01:38 -0400] conn=383671 op=3157 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"

But on the problematic hub server, I see:

# egrep 78b8cc871a3cda9f352580e797b270bc access
[12/Jul/2012:15:17:29 -0400] conn=2 op=58 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:17:29 -0400] conn=2 op=60 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:17:29 -0400] conn=2 op=61 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:42 -0400] conn=6 op=169 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:42 -0400] conn=6 op=171 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:42 -0400] conn=6 op=172 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:45 -0400] conn=3 op=170 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:45 -0400] conn=3 op=172 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:45 -0400] conn=3 op=173 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:51 -0400] conn=2 op=2234 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:51 -0400] conn=2 op=2236 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:51 -0400] conn=2 op=2237 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:55 -0400] conn=6 op=2233 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:55 -0400] conn=6 op=2235 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:55 -0400] conn=6 op=2236 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
[12/Jul/2012:15:24:57 -0400] conn=3 op=2234 MOD 
dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
...

I truncated the output for brevity, but there's over 250 MODs to that one 
object.  It's as if the server isn't able to do the replication bookkeeping and 
is accepting changes over and over again.  Eventually the disk fills up.

Do you see error messages from the supplier suggesting that it isattempting to send the operation but failing and retrying?

Do all of these operations have the same CSN? The csn will be loggedwith the RESULT line for the operation. Also, what is the err=? for theMOD operations? err=0? Some other code?


I just upgraded it to 1.2.10.12 as suggested and just to be safe, I'm doing a 
clean import.  We'll see how it goes.

--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users


--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] replication from 1.2.8.3 to 1.2.10.4

Reply via email to