[389-users] Re: A more profound replication monitoring of 389-ds instance

dweller dweller Thu, 20 Apr 2023 07:10:35 -0700

Yes, I'll try to explain my needs more clearly. As it happens a lot I recently 
inherited a FreeIPA installation and am now responsible for managing the 
service. As someone who was not previously familiar with FreeIPA, I am in the 
process of building my expertise in managing it. 
When I started the monitoring setup was represented with node_exporter, 
process_exporter for the host and 389ds_exporter 
(https://github.com/terrycain/389ds_exporter) for the ldap data. However, as 
the FreeIPA installation grew in size, we started encountering issues and 
realized that we lacked critical information to pinpoint the root causes of 
these problems. To address this, I have taken steps to improve the monitoring 
setup. I have started monitoring FreeIPA's bind service using a 
separate_exporter and exporting DNS queries to opensearch. Additionally, I have 
rewritten the 389ds_exporter to include cn=monitor metrics to provide more 
visibility into the 389 Directory Server.


I recently realized that I could also include 'cn=ldbm database' metrics, which 
are low-level but could be useful in troubleshooting the issues we are facing. 
The problems we are encountering are related to disk IO, and having these 
metrics could provide valuable insights into the following:

1) Excessive paging out and increased swap usage without spikes in load. For 
example after restarting of replica the swap usage increases to 30% (of 3GB 
swap space) over 1-2 days while there are at least 4GB of availiable RAM 
present on the host. And the general swap consumer is ns-slapd service. For now 
I only tested to configure swappiness parametr to zero, which did not help, so 
I guess there are some other factors involved.

2) Spikes in IO latency observed during modifying and adding operations, which 
were not present when the cluster was smaller (up to 10 replicas). I need to 
determine whether the issue lies with service tuning or with the cloud provider 
and its SAN, as we recently migrated to SSD disks without improvement. As I 
said about "replication lag" those problems just started more appearing as new 
replicas were added, but for now we mostly observe it by outage of services 
that rely on ldap. The "waves" refers to the way problem apprear, as different 
clients VDCs are having problems one after the other which is looks like 
replication propagation.

3) Master-master replication just seems to me as a big "black cloud", which I 
have no control or knowledge of. When you have couple of hosts it is maybe fine 
to rely on documented way of looking up replicationStatus attribute, but when 
you have couple of dozens I guess things could get quite not so straitforward, 
at least relying on intuition suggests it. When I say about replication 
observability what I mean and what I'd like to see is following:

Graph representation...

- ...of time it took to replay a change (or I guess time of full replication 
session)
- ...of the amount simutanialous connections that Suppliers trying to establish 
with Consumer
- ...of time spent waiting to acquire replica access

I just pointed a few of the top of my head. I don't know for sure (and first 
post was about it) is it really worth it to try and get those kind of metrics 
or I just don't know what I'm talking about and it would be a waste of time and 
hard to implement. As I mentioned bpf cause I see it as only option I could get 
it, the other option is to parse logs that are in DEBUG mode which is not the 
option.

With replication metrics besides the ability to see its impact on the problems 
above, I'm also trying to solve more administrative task - I need to convince 
the architerture departament to change the model of adding new replicas. Right 
now we kinda adding two replicas for every new client.


      +------------------------------+
      |           client#1           |
      |              VDC             |
      |                              |
      |       +--------------+       |       +---------------------+            
     +---------------------+
      |       |              +-------------->+                     
+---------------->+                     |   ...
      |       |  replica-01  |       |       |  common-replica-01  |            
     |  common-replica-02  |
      |       |              +<--------------+                     
+<----------------+                     |
      |       +--------------+       |       +---------------------+            
     +---------------------+
      |            |    ^            |                | ^                       
               | ^
      |            v    |            |                | |                       
               | |
      |       +--------------+       |                | |                       
               | |
      |       |              |       |                | |                       
               | |
      |       |  replica-02  |       |                | |                       
               | |
      |       |              |       |                | |                       
               | |
      |       +--------------+       |                | |                       
               | |
      |                              |                | |                       
               | |
      |                              |                v |                       
               v |
      +------------------------------+       +---------------------+            
     +---------------------+
                                             |                     
+---------------->+                     |
                                             |  common-replica-03  |            
     |  common-replica-04  |
                                             |                     
+<----------------+                     |
                                             +---------------------+            
     +---------------------+

Which is not ideal at all (and as I said we started to face problems). From 
their side the answer is that they are following documentation restrictions for 
no more than 4 replica agreements for replica and no more than 60 simutanialous 
replicas in master-master replications. And for now this is indeed being 
followed and I need to come with deeper analysis or find that problem lies in 
fine tuning the service.

So it's kind mishmash of everything at the same time, hope I answered you 
question.

best regards, 
v.zh
_______________________________________________
389-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/[email protected]
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[389-users] Re: A more profound replication monitoring of 389-ds instance

Reply via email to