[ 
https://issues.apache.org/jira/browse/IGNITE-23910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-23910:
---------------------------------------
    Epic Link: IGNITE-22905

> Ability to note side effects produced by diverged Metastorage revisions which 
> were lost
> ---------------------------------------------------------------------------------------
>
>                 Key: IGNITE-23910
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23910
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>
> The following is possible:
>  # Metastorage majority goes down (maybe other nodes that don't vote in 
> Metastorage as well)
>  # One of those nodes (named A) had a command X in its Metastorage Raft log 
> that not a single remaining node has in its log. Before crashing, A had 
> applied this command in the Metastorage (this was not flushed to disk), then 
> some side-effect was produced via Metastorage watch, and the side effect WAS 
> flushed to disk
>  # User repairs Metastorage on the remaining nodes (they don't contain X in 
> their logs and did not apply it)
>  # A is brought back online. X was the only diverging command, and the fact 
> of its existence was not saved durably to the Metastorage underlying storage 
> (it was saved to the Raft log on A, but we are not looking at logs during 
> reentry procedure). As such, Metastorage checksum on A matches the checksum 
> for the new Metastorage leader (say, B)
>  # As a result, we allow node A join the new cluster, even though it has some 
> side-effect of X persisted
> Such a side-effect could be a tuple in new schema if X was an ALTER TABLE 
> command; A would contain this tuple version, but other nodes would have no 
> idea about this new version. This is an inconsistency we want to avoid.
> Possible ways to solve this:
>  # Syncing each checksum write. This solves the problem completely, but it's 
> too costly, as it turned out. We tried and reverted this
>  # Syncing checksums column family before flushing partitions to disk. This 
> will make sure that the most frequent and important case of writing a tuple 
> with new schema about which no one in the cluster knows (described above) is 
> solved. However, other possible persistent side-effects (like writing to a 
> Vault from a Metastorage watch) are not covered; but these cases might not be 
> that important
>  # Checksumming not revisions (where we store checksums in the storage 
> itself), but Raft log entries (where we only checksum entries containing 
> commands that can possibly influence the Metastorage content, like puts and 
> invokes, but not safe time propagation and compaction commands). This will 
> require a tight integration with JRaft; also, when checking for divergence 
> (when reentering the cluster by a node that did not witness a Metastorage 
> repair), we would have to be very pessimistic taking into account even log 
> entries that were never applied, so we could have false positives



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to