> On 22 Dec 2019, at 08:22, Christophe Trefois <tre...@gmail.com> wrote:
> 
> First off, apologies for double posting to here and ipa mailing list, but we 
> are getting a bit uneasy, and also the issue seems to come from the code in 
> 389-ds directly, so this seems more appropriate.

Hi there, thanks for contacting us. Happy your you to post here.

> 
> We are using ipa-server ipa-server-4.6.5-11.el7.centos.3.x86_64 with 
> 389-ds-base-1.3.9.1-10.el7.x86_64 on CentOS 7.7.
> Since couple days some of our replicas are coming with "csngen_new_csn - 
> Sequence rollover; local offset updated." messages in the slapd erorr logs. 

This isn't a problem, but you should investigate the possible causes. The short 
answer is that we are pushing the lamport clock ahead due to either high writes 
or the system clock being stepped backwards.

To see the code look at:

https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/csngen.c#_195

You should probably for sanity checking investigate:

* If you have high write load in your environment that is not expected
* If you have issues with ntp consistency on your machines (continually 
advancing or reversing)
* Conflict between a virtualised time sync service is vmware/libvirt vs ntp 
causing time jumps


For a slightly longer explanation. The CSN is a lamport clock, IE it can only 
advance, but never step back. It's based on the current unix time in seconds, 
with a sub-counter that is 16 bit. IE we can have 65535 writes "per second".

This is because if you have say:

Write object A
Ntp syncs clock backwards
Write object B

We need the CSN of these to still reflect the true order of operations - that A 
occurs before B, as we use time as the sync source between replicas rather than 
locking/consensus. If the CSN didn't use lamport clock the changelog would show 
B before A which is incorrect for reasons that are extremely complex and subtle.

So with the CSN being a lamport clock, if ntp sets your time backwards, the CSN 
stays at the "highest" time, and the subcounter keeps incrementing. If this 
continues for a long time, we overflow the 16bit sub counter - we can't have 
duplicate CSN so the local offset (aka seconds) is increased to push the CSN's 
always forward.

That's why I recommend you check your write load and ntp/system time.


Hope that helps, 

> 
> We use the python "ipa_check_consistency" and replication seems to be fine. 
> 
> We checked all replicas, and they are all in time sync with ntp (updated) 
> with no visible offset. 
> 
> is this anything to worry about, and how can we make those messages to stop 
> appearing?
> _______________________________________________
> 389-users mailing list -- 389-users@lists.fedoraproject.org
> To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org

Reply via email to