Re: [EXT] Re: slapd 2.5 dumping core on delta-synrepl issues

Ondřej Kuzník Tue, 12 Aug 2025 03:22:10 -0700

On Tue, Aug 12, 2025 at 06:50:11AM +0000, Windl, Ulrich wrote:
> From: Ondřej Kuzník <on...@mistotebe.net>
>> On Tue, Aug 05, 2025 at 12:50:44PM +0000, Windl, Ulrich wrote:
>>> Hi!
>>>
>>> I have a support case from SUSE's version of slapd open, and I wonder
>>> about one specific statement from support:
>>>
>>> A core dump is triggered by
>>>
>>> syncprov.c:2360:
>>>
>>> assert( !BER_BVISEMPTY( &oldestcsn ) && !BER_BVISEMPTY( &newestcsn ) &&
>>>         ber_bvcmp( &oldestcsn, &newestcsn ) < 0 );
>>>
>>> Support explained: "Any of these indicates the changelog (accesslog)
>>> is in a completely inconsistent or corrupted state."
>> 
>> Hi Ulrich,
>> based on what have they concluded this? There is very little to go on in
>> what you've provided here.
>
> Well support had received a core dump which you don't have, of course.


Hi Ulrich,
the core dumps should still be stored on the system. You can also tell
SUSE support to reach out to the project if they need help. That is
what they would/should do with any other open-source project anyway.

> As I remember it we had two servers using bi-directional
> delta-syncrepl that also pulled updates from a third server using
> RefreshAndPersist (as that server was still running slapd 2.4).
> During migration of the third serer to OpenLDAP 2.5 sync did not work
> as expected (I had made some mistakes), so I dumped the main DIT on
> the third server and slapadd-ed the LDIF into the two servers, but

I remember asking questions about this. Did you use slapcat? Did you
use the actual LDIF you got (no "reprocessing"), making sure you slapadd
*without* -w?

> when starting those, there was some problem (I think the servers just
> refused to respond (i.e. took "forever" instead of responding)), si

Something to get logs out of and looking at cn=monitor, ...

> support had sent a "repaired" version of slapd.
> On the very first attempt to run that fixed version it dumped core, so
> I had contacted support again, asking why it would dump core.
> Support also had received some messages being written before the core dump.
> 
> The core dumps are actually not all the same, but the last one I saw was like 
> this
> Jul 10 08:37:13 h02 slapd[1047558]: conn=-1 op=0 accesslog_response: got 
> result 0x44 adding log entry reqStart=20250710063713.000001Z,cn=audit

The above *might* be related to ITS#10379. However as for the rest:

> Jul 10 08:37:13 h02 slapd[1047558]: slap_sl_malloc of 93818789174842 bytes 
> failed
> Jul 10 08:37:13 h02 kernel: __vm_enough_memory: pid: 1047562, comm: slapd, 
> not enough memory for the allocation
> Jul 10 08:37:13 h02 kernel: __vm_enough_memory: pid: 1047562, comm: slapd, 
> not enough memory for the allocation
> Jul 10 08:37:13 h02 kernel: __vm_enough_memory: pid: 1047562, comm: slapd, 
> not enough memory for the allocation
> Jul 10 08:37:13 h02 systemd[1]: Started Process Core Dump (PID 1627168/UID 0).

this is unlikely to be related to ITS#10379. Again without a "full
backtrace" there is very little we can say.

> As SUSE's SLES version is not the "plain vanilla" type of slapd I
> think SUSE should do that if they feel the bug is from the base they
> are using.

You can check that for yourself[0]: spoiler alert, they don't apply any
patches so they are probably more "plain vanilla" than you think. As
such whatever crashes you're seeing probably affect 2.5.18 as released
(on the other hand, 2.5.18 is not even the latest OpenLDAP 2.5).

Unless they're providing you something that's kept private that is.

>>> The support recommends to reset the CSNs by disabling any replication
>>> (which doesn't remove those IMHO) and "either using syncrepl or
>>> delta-syncrepl, but not mixing both.":
>>>
>>> I don't see a problem if one dependent server gets the changed through
>>> "classic methods" (e.g. Refresh), and another server gets updates
>>> through delte-syncrepl. Am I wrong?
>> 
>> Stating "either using syncrepl or delta-syncrepl, but not mixing both."
>> sounds concerning. You haven't provided any sort of configuration
>> snippets or even basic description of your set up to say if we should be
>> concerned about this.
>
> Read the description I provided at the start. Sometimes it's tricky to
> upgrade a MMR-configuration "online", and I decided to break apart the
> configuration temporarily, so that the newer servers would use
> delta-syncrepl between each other while doing RefreshAndPersits for
> the old servers. Of course sync of configuration was broken at that
> time too.

2.4 has the same deltasync capabilities as 2.5/2.6, there is no reason
to run a strange combination of replication types in the same MPR
cluster. Again, there is so much useful information that you are
witholding if you want some help/advice. Start over, set the scene and
how you designed your environment and what is not working, what you are
doing, what you see in the logs and on the LDAP level.

If you want to help get to a stable footing, at a minimum:
- update your configuration to harmonise all provider's configs (even if
  some are at 2.4, just don't enable new functionality), triple check
  your ACLs are up to scratch for replication (read access to
  *everything*)
- since you're going to be planning an outage to do this, also select
  one host and during the outage slapcat this host's main DB, wipe all
  host's DBs (including accesslogs!, especially if doing deltasync) and
  slapadd the copy you just reserved as-is everywhere.
- get contextCSN monitoring set up - this is your last resort warning
  system if replication stalls. If that happens, examine logs ("sync"
  level), 99% of the time this is due to misconfiguration and seeing
  where things went wrong should give you hints on where to look
- once you're stable and ready to drop 2.4, you can do so and that lets
  you finally enable features that 2.4 doesn't have

>> Are these servers also providers? All providers need to have *identical*
>> configuration and *full* read access to other provider's DB (both main
>> DB and accesslog if used).
>
> Again, read above: The new servers were also providers.

You seem to have missed the point of this note:
All providers need to have *identical* configuration and *full* read
access to other provider's DB (both main DB and accesslog if used).

>>> Finally support concludes: "Please note that these types of
>>> replication integrity issues do not affect 389 Directory Server, which
>>> uses a more robust mechanism for change tracking and includes a proper
>>> Lamport clock implementation."
>> 
>> AFAIK 389DS replication is push based, so the design and behaviours are
>> quite different. Also assuming we're looking at the above, their comment
>> seems somewhat random in context.
> 
> Well, SLES 15 officially had abandoned OpenLDAP in favor of 389DS (but
> did not provide usable tools or documentation to allow a successful
> migration of the databases), but then (maybe due to external pressure)
> decided to re-support OpenLDAP starting at SP5 od SLES15 (or so). So I
> guess supported wanted to say that I should use 389DS instead, but
> that isn't an option now.

If you don't trust they are capable of supporting you, you have another
option: use Symas[1] or LTB[2] packages and/or contract someone[3] to
assist you in depth.

[0]. https://build.opensuse.org/package/show/SUSE:SLE-15-SP5:Update/openldap2_5
[1]. https://repo.symas.com/
[2]. https://ltb-project.org/download.html
[3]. https://openldap.org/support/

Regards,

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Re: [EXT] Re: slapd 2.5 dumping core on delta-synrepl issues

Reply via email to