[
https://issues.apache.org/jira/browse/IGNITE-23910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-23910:
---------------------------------------
Description:
The following is possible:
# Metastorage majority goes down (maybe other nodes that don't vote in
Metastorage as well)
# One of those nodes (named A) had a command X in its Metastorage Raft log
that not a single remaining node has in its log. Before crashing, A had applied
this command in the Metastorage (this was not flushed to disk), then some
side-effect was produced via Metastorage watch, and the side effect WAS flushed
to disk
# User repairs Metastorage on the remaining nodes (they don't contain X in
their logs and did not apply it)
# A is brought back online. X was the only diverging command, and the fact of
its existence was not saved durably to the Metastorage underlying storage (it
was saved to the Raft log on A, but we are not looking at logs during reentry
procedure). As such, Metastorage checksum on A matches the checksum for the new
Metastorage leader (say, B)
# As a result, we allow node A join the new cluster, even though it has some
side-effect of X persisted
Such a side-effect could be a tuple in new schema if X was an ALTER TABLE
command; A would contain this tuple version, but other nodes would have no idea
about this new version. This is an inconsistency we want to avoid.
Possible ways to solve this:
# Syncing each checksum write. This solves the problem completely, but it's
too costly, as it turned out. We tried and reverted this
# Syncing checksums column family before flushing partitions to disk. This
will make sure that the most frequent and important case of writing a tuple
with new schema about which no one in the cluster knows (described above) is
solved. However, other possible persistent side-effects (like writing to a
Vault from a Metastorage watch) are not covered; but these cases might not be
that important
# Checksumming not revisions (where we store checksums in the storage itself),
but Raft log entries (where we only checksum entries containing commands that
can possibly influence the Metastorage content, like puts and invokes, but not
safe time propagation and compaction commands). This will require a tight
integration with JRaft; also, when checking for divergence (when reentering the
cluster by a node that did not witness a Metastorage repair), we would have to
be very pessimistic taking into account even log entries that were never
applied, so we could have false positives
was:
The following is possible:
# Metastorage majority goes down (maybe other nodes that don't vote in
Metastorage as well)
# One of those nodes (named A) had a command X in its Metastorage Raft log
that not a single remaining node has in its log. Before crashing, A had applied
this command in the Metastorage (this was not flushed to disk), then some
side-effect was produced via Metastorage watch, and the side effect WAS flushed
to disk
# User repairs Metastorage on the remaining nodes (they don't contain X in
their logs and did not apply it)
# A is brought back online. X was the only diverging command, and the fact of
its existence was not saved durably to the Metastorage underlying storage (it
was saved to the Raft log on A, but we are not looking at logs during reentry
procedure). As such, Metastorage checksum on A matches the checksum for the new
Metastorage leader (say, B)
# As a result, we allow node A join the new cluster, even though it has some
side-effect of X persisted
Such a side-effect could be a tuple in new schema if X was an ALTER TABLE
command; A would contain this tuple version, but other nodes would have no idea
about this new version. This is an inconsistency we want to avoid.
To be continued...
> Ability to note side effects produced by diverged Metastorage revisions which
> were lost
> ---------------------------------------------------------------------------------------
>
> Key: IGNITE-23910
> URL: https://issues.apache.org/jira/browse/IGNITE-23910
> Project: Ignite
> Issue Type: Improvement
> Reporter: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
>
> The following is possible:
> # Metastorage majority goes down (maybe other nodes that don't vote in
> Metastorage as well)
> # One of those nodes (named A) had a command X in its Metastorage Raft log
> that not a single remaining node has in its log. Before crashing, A had
> applied this command in the Metastorage (this was not flushed to disk), then
> some side-effect was produced via Metastorage watch, and the side effect WAS
> flushed to disk
> # User repairs Metastorage on the remaining nodes (they don't contain X in
> their logs and did not apply it)
> # A is brought back online. X was the only diverging command, and the fact
> of its existence was not saved durably to the Metastorage underlying storage
> (it was saved to the Raft log on A, but we are not looking at logs during
> reentry procedure). As such, Metastorage checksum on A matches the checksum
> for the new Metastorage leader (say, B)
> # As a result, we allow node A join the new cluster, even though it has some
> side-effect of X persisted
> Such a side-effect could be a tuple in new schema if X was an ALTER TABLE
> command; A would contain this tuple version, but other nodes would have no
> idea about this new version. This is an inconsistency we want to avoid.
> Possible ways to solve this:
> # Syncing each checksum write. This solves the problem completely, but it's
> too costly, as it turned out. We tried and reverted this
> # Syncing checksums column family before flushing partitions to disk. This
> will make sure that the most frequent and important case of writing a tuple
> with new schema about which no one in the cluster knows (described above) is
> solved. However, other possible persistent side-effects (like writing to a
> Vault from a Metastorage watch) are not covered; but these cases might not be
> that important
> # Checksumming not revisions (where we store checksums in the storage
> itself), but Raft log entries (where we only checksum entries containing
> commands that can possibly influence the Metastorage content, like puts and
> invokes, but not safe time propagation and compaction commands). This will
> require a tight integration with JRaft; also, when checking for divergence
> (when reentering the cluster by a node that did not witness a Metastorage
> repair), we would have to be very pessimistic taking into account even log
> entries that were never applied, so we could have false positives
--
This message was sent by Atlassian Jira
(v8.20.10#820010)