On Tue, Feb 28, 2023 at 16:12:25 +0100, Ondřej Kuzník wrote:
> On Tue, Feb 28, 2023 at 01:42:20PM +0100, Geert Hendrickx wrote:
> > We've had (and still have) this issue with large attributes and large 
> > multi-valued attributes with Zimbra (see previous discussion with Quanah),
> > where we applied sortvals and multival.  But in this scenario it's not the
> > case; all objects are of similar small size, with (mostly) single valued
> > attributes.  Yet our freelist reaches 200K+ free pages during periods with
> > heavy updates (mostly deletes/adds), which has a measurable impact on write
> > performance.
> 
> Hi Geert,
> are you sure it's the freelist and not the random access as pages become
> non-contiguous? The former would represent a constant decline in
> performance where the latter would eventually taper from high (best
> case) performance to regular performance you should be able to expect?
> Have you been able to rule that out?


mdb_copy -c fixes it, so I assume it's only the freelist size, not actual
fragmentation (mdb_copy doesn't reorder any data, right?).
Random access shouldn't matter much, as it's all on an SSD-based SAN.

Also, the decline isn't constant.  In normal operations, the freelist stays
fairly small (it is "consumed" all the time by regular updates).  Only
during batch updates (because of a currently ongoing migration) it explodes
and doesn't get "consumed" in time for the next batch update, and causes
performance degradation for subsequent batches.


> After you kill accesslog, you disable deltasync. Since you're also
> restarting, the provider has no data on how to replay anything and needs
> to send the list of all entries (at least their UUIDs). This is
> expensive and slow. Replication seems to proceed in slow leaps that cost
> a *lot* of processing on the provider and a fair amount of bandwidth.
> Isn't that what you're seeing?


Yes, this is indeed the case and it keeps doing that as long as updates are
coming in.  Once there are no updates for a full refresh cycle (eg. during
the night, or because we pause updates) it is able to revert to delta sync.


> After you kill accesslog, you disable deltasync. 

This is the essential part.  I always assumed it could proceed with
deltasync of the provider and replica have the same contextCSN, even with
an empty accesslog.

This probably went un-noticed for a long time since dropping the accesslog
on a non-active master causes no (visible) delays.  Only on an active master.


Thanks for your insights, things are much clearer now, and we have adjusted
our processes accordingly.


        Geert

Reply via email to