[
https://issues.apache.org/jira/browse/IGNITE-23910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-23910:
---------------------------------------
Epic Link: IGNITE-22905
> Ability to note side effects produced by diverged Metastorage revisions which
> were lost
> ---------------------------------------------------------------------------------------
>
> Key: IGNITE-23910
> URL: https://issues.apache.org/jira/browse/IGNITE-23910
> Project: Ignite
> Issue Type: Improvement
> Reporter: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
>
> The following is possible:
> # Metastorage majority goes down (maybe other nodes that don't vote in
> Metastorage as well)
> # One of those nodes (named A) had a command X in its Metastorage Raft log
> that not a single remaining node has in its log. Before crashing, A had
> applied this command in the Metastorage (this was not flushed to disk), then
> some side-effect was produced via Metastorage watch, and the side effect WAS
> flushed to disk
> # User repairs Metastorage on the remaining nodes (they don't contain X in
> their logs and did not apply it)
> # A is brought back online. X was the only diverging command, and the fact
> of its existence was not saved durably to the Metastorage underlying storage
> (it was saved to the Raft log on A, but we are not looking at logs during
> reentry procedure). As such, Metastorage checksum on A matches the checksum
> for the new Metastorage leader (say, B)
> # As a result, we allow node A join the new cluster, even though it has some
> side-effect of X persisted
> Such a side-effect could be a tuple in new schema if X was an ALTER TABLE
> command; A would contain this tuple version, but other nodes would have no
> idea about this new version. This is an inconsistency we want to avoid.
> Possible ways to solve this:
> # Syncing each checksum write. This solves the problem completely, but it's
> too costly, as it turned out. We tried and reverted this
> # Syncing checksums column family before flushing partitions to disk. This
> will make sure that the most frequent and important case of writing a tuple
> with new schema about which no one in the cluster knows (described above) is
> solved. However, other possible persistent side-effects (like writing to a
> Vault from a Metastorage watch) are not covered; but these cases might not be
> that important
> # Checksumming not revisions (where we store checksums in the storage
> itself), but Raft log entries (where we only checksum entries containing
> commands that can possibly influence the Metastorage content, like puts and
> invokes, but not safe time propagation and compaction commands). This will
> require a tight integration with JRaft; also, when checking for divergence
> (when reentering the cluster by a node that did not witness a Metastorage
> repair), we would have to be very pessimistic taking into account even log
> entries that were never applied, so we could have false positives
--
This message was sent by Atlassian Jira
(v8.20.10#820010)