[389-users] Re: Determining max CSN of running server

Thierry Bordaz Thu, 29 Feb 2024 01:48:35 -0800


On 2/29/24 05:12, William Faulk wrote:

Might be worth re-reading

Well, I still don't really know the details of the replication process.


I have deduced that changes originated on a replica seem to prompt that replica 
to start a replication process with its peers, but I don't really know what 
happens then.

Replication is done by replica agreement that is waken up when a newupdates gets into the changelog. The new updates can be receiveddirectly from a LDAP client or from replication itself.

There's a comparison of the RUVs of the two replicas, but does the initiating 
system send its RUV to the receiver, or does it go the other way, or do both 
happen?

IIRC only the remote replica sends its RUV. Then the RA receiving theRUV will compare it with its own RUV to detect what is the oldest updatethat the remote replica ignore.

Does the comparison prompt the comparing system to send the changes it thinks 
the other system needs, or does it cause the comparing system to request new 
changes from the other?

Yes the RUV contains latest received updates for all the replicas.

Maybe none of this really makes much difference, but the lack of technical 
detail around this makes me just question everything.

It makes perfectly sense and show you already know deeply replicationprocess.

It doesn't send a single CSN, the replication compares the RUVs and determines 
the
range of CSNs that are missing from the consumer.

Sure, but notionally any changes that originated on that replica would be 
reflected in the max CSN for itself in the RUV that is used to compare. And at 
least one side is sending its RUV to the other during the replication process.

Yes the remote replica (named consumer IIRC) sends back its RUV to therequest send by the RA.

It's also not immediate. Between the server accepting a change (add, mod etc), 
the
change is associated to a CSN. But then there may be a delay before the two 
nodes actually
communicate and exchange data.

Sure, but the changes originated on this replica haven't made it to other 
replicas in weeks. This isn't a mere delay in replication.

Usually replication occurs in few seconds. if it is not replicated forweeks, then replicaiton is broken and you need to identify in thereplication debug log from the both sides (supplier/consumer) the reasonof that breakage

Generally you'd need replication logging (errorloglevel 8192). But it's very 
noisy
and can be hard to read. What you need to see is the ranges that they agree to 
send.

Okay. I've done that and haven't had a chance to pore through them yet.

Quite difficult to read, espcially if there are multiple RA playingaround. You may look in parallel to the code to understand the purposeof those messages

Also remember CSN's are a monotonic lamport clock. This means they only ever 
advance
and can never step backwards. So they have some different properties to what 
you may
expect. If they ever go backwards I think the replication handler throws a 
pretty nasty
error.

I don't think it's going backwards. What I'm trying to rule out is that the 
replica is failing to advance its max CSN in the RUV being used to compare.

Comparison of RUV. You need to dump RUV on both servers(consumer/supplier) then compare PER replica the maxcsn. The replicationwill start from the CSN that is the smallest of the maxcsn. So a maxCSNmay not move until all the others are in sync

I *think* so. It's been a while since I had to look. The nsds50ruv shows the 
ruv of
the server, and I think the other replica entries are "what the peers ruv was 
last
time".

Well, it's at least nice to hear that my guess at least isn't asinine. :)

replication monitoring code in newer versions does this for you, so I'd probably
advise you attempt to upgrade your environment. 1.3 is really old at this point

I've been trying to get the current environment stable enough that I feel 
comfortable going through the relatively lengthy upgrade process. I think I'm 
going to have to adjust my comfort level.

I'm not sure if even RH or SUSE still support that version anymore).

RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 
months. They're working on this with me. I'm still just trying to understand 
the system better so that I can try to be productive while I'm waiting on them 
to come up with ideas.

The problem here is that to read the RUV's and then compare them, you need to 
read
each RUV from each server and then check if they are advancing (not that they 
are equal).

The problem is that the changes in my environment are few enough that all the 
replicas' RUVs _are_ equal the majority of the time. I'm not in front of that 
system as I respond right now, so my details might be wrong, but I'm asking 
about all of this because every RUV I see in all of the replicas is the same, 
and it shows a max CSN for this one replica that's much older than the CSNs I 
see it reference in the logs about changes originating on the replica. The CSNs 
I see in the logs when a new change is made are referencing the current time in 
them, while the max CSN I see in the RUVs is from 4 months ago.

Maybe it *did* go backwards somehow and that's why it's not working. Not that 
that would really help me understand what actually went wrong any better than I 
do now.

Something important with RUV is the 'replicageneration' it should beidentical on both side.

For the problematic server, does the RUV evolve or not ?

If you want to assert that "Some change I made at CSN X is on all servers" then
you would need to read and parse the ruv and ensure that all of them are at or 
past that
CSN for that replica id.

Well, you'd think so. I've got that problem, too, where some CSNs just seem to 
get missed, but the max CSN in the RUV is well past that. But that's a 
different problem and not the one I'm working on now.

Thanks for the input.

--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[389-users] Re: Determining max CSN of running server

Reply via email to