On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
> It's becoming untenable how a plain refresh cannot be represented in
> accesslog in a way that's capable of serving a deltasync session.
> Whatever happens, we have lost a fair amount of information to run a
> proper deltasync yet if we don't want to abandon this functionality, we
> have to try and fill some in.
There is no record of the expectations for deltasync in a multiprovider
environment, so probably worth putting this down too what they are from
my view, not necessarily Howard's, who actually wrote the thing.
The intention is convergence[0] first and foremost, while sending the
changes rather than the full entries.
Since conflicting writes will always happen and every node will have
their view of the DB at the time, they might make a different decision.
In deltasync, each will record the final modification into their
accesslog which might differ and this cascades bounded only by the
number of hosts involved[1]. In the end, we need all hosts that read
various versions and subsections of the log in varying order to always
converge.
A historic expectation has always been that accesslog be written in
order and relayed in the same order as written[2], implicitly assuming
that CSNs for each SID will always be stored in a non-descending order.
This is why some backends (back-ldif) are not suited to contain
accesslog DBs. This non-descending storage expectation might need to be
revisited if necessary, hopefully not.
Another expectation is that fallback behaviour be both graceful and
efficient: similarly to convergence, sessions will eventually move back
to deltasync if at arbitrary point we were to stop introducing
*conflicting* changes into the environment. At the same time, for the
sake of convergence, we need to be tolerant of some/all links running a
plain syncrepl refresh at some points in time.
We have to expect to be running in a real-world environment, where
arbitrary[3] topologies might be in place and any number of links and/or
nodes can be out of commission for any amount of time. When isolated
node rejoin, they should be able to converge eventually, regardless how
long the isolation/partition was. Still, we can't require that the
accesslog size is unbounded and need to be able to detect when we don't
retain the relevant data anymore and work around it[4].
Each node's accesslog DB should always be self-consistent: If a
read-only consumer starts with the same DB as the provider at some
point, they shall always be able to replay its accesslog cleanly,
regardless of what kinds of conflict resolution the provider had to go
through. N.B. If it is impossible to write a self-consistent accesslog
in certain situations, it is ok to pretend as if certain parts of the
accesslog have already been purged, e.g. by attaching meta-information
understood by syncprov to that effect.
Regardless of the promises stated above, we should also expect that
administrators deploying any multiprovider environment actively monitor
it. Just like backups, if replication is not checked routinely, it
almost always breaks when you actually need it. There are multiple
resources on how to do so and more tools can and will be developed as
the need is identified.
There are also some non-expectations, generally shared with plain
syncrepl anyway:
- If a host/network link is underpowered for the amount of changes
coming in, they might fall behind, this doesn't affect eventual
convergence[0] and it is up to the administrator to size their
environment correctly
- Any host configured to accept writes will do so, allowing conflicts to
arise, any/all of these might be (partially) reverted in the face of
conflicting writes elsewhere in the environment, note that this is
already the case with plain syncrepl
- We do not aim to minimise the number of "redundant" messages passed if
there are multiple paths between nodes. LDAP semantics do not allow
this to be done in a safe way with a CSN based replication system
I hope I haven't missed anything important.
[0]. Let's take the usual definition of eventual convergence - if at
arbitrary point we were to stop introducing new changes to the
environment and restore connectivity, all participating nodes will
arrive at identical content in a finite number of steps (and
there's a way to tell when that's happened)
[1]. Contrast this with "log replication" in Raft et al., where all
members of a cluster coordinate to build a shared view of the
actual history, not accepting a change until it has been accepted
by a majority
[2]. If this assumption is violated like in ITS#9358, the consumer will
have to skip some legitimate operations and diverge
[3]. We can still assume that in the designed topology all the nodes
that accept write operations belong to the same strongly connected
component
[4]. This is the assumption that was at the core of the issue described
in ITS#9823
--
Ondřej Kuzník
Senior Software Engineer
Symas Corporation http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP