[Issue 10293] Log operations generated by syncrepl at STATS level

openldap-its Wed, 19 Nov 2025 09:05:20 -0800

https://bugs.openldap.org/show_bug.cgi?id=10293


--- Comment #9 from [email protected] <[email protected]> ---
(In reply to Ondřej Kuzník from comment #6)
> On Mon, Aug 18, 2025 at 10:05:50PM +0000, [email protected] wrote:
> > The requirement is for STATistics (STAT level) records representing the
> > activity on an instance of slapd that is initiated/processed because of
> > `syncrepl` traffic initiated by a consumer (Master or Replica) via a 
> > syncrepl
> > query. At a minimum it should report the "operations" performed by the 
> > relevant
> > Provider (Master, by definition) and made available by `syncprov` for 
> > retrieval
> > by consumers. 
> 
> Hi Marty,
> on the consumer, we only see what is received not what provider
> performed. Even more so in a refresh. This matters and it's why I keep
> asking what you *need*.

And this gets to the heart of what "I NEED" (on behalf of the customer). For
each update request handled by any producer the same update is performed on
every consumer (replicated). I MAY have included irrelevant SRCH and CMP ops in
my original work but the ADD, MOD, MODDN, and DEL changes are reflected and
processed as received. I only suggest that those operations are invisible to
all but those processing the SYNC level logs and deserve to be represented in
the STATS level log as a courtesy to the users of simplified log reduction
packages like SPLUNK and DATA DOG..

> 
> > It is important that the "operations" performed during refresh
> > also be consistently represented in the STAT log as these represent work 
> > being
> > done on the relevant server in support of achieving consistency of data 
> > through
> > the replication mechanism(s).
> 
> There is no consistency, refresh (all forms), persist and delta are
> wildly different in what provider meant and what we do with it. We can
> still log what we *receive*, not each individual work item involved in
> dealing with the message. Especially in the context of what you say
> further on.

This feels a lot like quibbling. I see what I'm trying to get promoted out of
the SYNC log to the STATS log clearly in the SYNC log.

> 
> > The requirement includes MUST make these records
> > in general format to the existing STAT-level records for BIND, ADD, SRCH, 
> > etc.
> > These records MAY be represented merely as RESULT (style) records as there 
> > is
> > little to report about the start/receipt of the "operation". 
> > 
> > The justification for this requirement is that the STAT(istics) level 
> > logging
> > is expected to represent the actual load and performance on a server ... its
> > representation of "local" load which is often quite different than the 
> > actual
> > "load" on other servers in the cluster (RT3402 for example). And the 
> > difference
> > is quite often found to be the operation of the replication mechanism(s).
> > Without meeting the requirement above the end-user or support engineer has 
> > to
> > study the SYNC level log records to understand the true "load" on the 
> > server.
> 
> You can and should observe true "load" through cn=monitor already - the
> average number of active workers (threads). Nothing will come close to
> this metric in terms of accuracy.

In spite of the enormous value of the cn=monitor counters they are largely
ignored by the tools used by customers. AWS has made no effort to add them to
Cloudwatch and we, by offering our Telegraf/Prometheus work to them will at
least take a step to showing them what SHOULD be done to support our OpenLDAP
customers using their platform.

> 
> Based on the above I believe you're actually interested in the
> contribution that syncrepl consumers make to this true "load" and will
> interpret the rest accordingly.
> 
> > Furthermore, the monitoring tools would be required to understand the more
> > complex SYNC logging formats and content. Most customers rely at best on AWS
> > Cloudwatch (a Grafana based system) and/or batch log reduction programs like
> > SPLUNK and DataDog which lack the more complex processing capabilities to
> > understand SYNC logging. The best of our customers have no idea what the 
> > total
> > "load" being presented to their OpenLDAP servers and are unable to diagnose
> > performance issues without it.
> 
> Instead of flooding the STATS level with yet another firehose of data,
> wouldn't exposing a cn=monitor counter like "total time spent on
> processing messages from this provider" be better? That's the exact
> contribution you're looking for, no expensive log processing needed
> *and* fairly easy to provide.
> 

For the reasons above about the lack of actual customer use of cn=monitor I
would respond with NO. cn=monitor is only valuable processed into a time-series
database and through analysis tools like Grafana and Prometheus. And it lacks
information like the etime (already hidden in the SYNC log) and rid from which
the update was replicated.

> If not, let's be concrete then, have a look at the below and see if it
> covers what you're after. "etime=" are seconds since we started
> processing the message, so to the extent you consider current etime=
> logging to represent "load", it should too. There is no "hidden"
> processing time except overheads that would have been hidden from
> current STATS messages as well. And certainly no double-counting which
> your previous suggestions would have been prone to.
> 
> RFC4533 which we're processing defines the following message types (I
> won't touch on other replication protocols we support e.g. dirsync):
> 
> Intermediate "newcookie" message:
> rid=001 SYNC NEW_COOKIE cookie=<cookie value> etime=0.123

Right ... that was actually an addition we made at least partially to earlier
suggestions/pressure from my requests. No?

> 
> For a refreshDelete/refreshPresent Intermediate message:
> rid=012 SYNC REFRESH_DELETE refreshDone=0 cookie=<cookie value if sent>
> etime=0.012
> rid=012 SYNC REFRESH_PRESENT refreshDone=0 cookie=<cookie value if sent>
> etime=0.012
>

At no time did I ask for that.

> For a syncIdSet (present/delete phase contents) Intermediate message:
> rid=123 SYNC ID_SET delete=0|1 cookie=<cookie value if sent> etime=0.234
> result=<"processed"/"failed">
> 

Or that.

> For a search entry message regardless of how we ended up interpreting
> it:
> rid=201 SYNC ENTRY dn="<dn as received>" state=<"state" field from the
> control: 0|1|2|3> cookie=<cookie value if sent> etime=0.123
> result=<"processed"/"skipped"/"failed">
>

I wonder what the producer would be searching for directly via the replication 
process. I was under the impression this was part of something like delta-sync
and not in reference to recording the occurence of an update replicated down. I
would be happy to better understand how this fits such that it
could/should/would be part of what I think the customer wants/needs.

> For a search result message (end of refreshOnly):
> rid=321 SYNC RESULT err=<resultCode as received> delete=0|1 cookie=<value if
> sent> etime=0.... result=<"processed"/"failed">
>

See previous comment about search in the SYNC information log.

> I believe this is a little too much information to put into the logs but
> this is what might fit the description above. It might also not be
> feasible to provide "result=skipped" messages and they might show
> "processed" instead, can't tell right now.
>

I hope taking all but the actual updates as requested through the original
producer off the table it is less of a firehose and more reasonable.

> Regards,

Marty

-- 
You are receiving this mail because:
You are on the CC list for the issue.

[Issue 10293] Log operations generated by syncrepl at STATS level

Reply via email to