[ 
https://issues.apache.org/jira/browse/IGNITE-23910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-23910:
---------------------------------------
    Description: 
The following is possible:
 # Metastorage majority goes down (maybe other nodes that don't vote in 
Metastorage as well)
 # One of those nodes (named A) had a command X in its Metastorage Raft log 
that not a single remaining node has in its log. Before crashing, A had applied 
this command in the Metastorage (this was not flushed to disk), then some 
side-effect was produced via Metastorage watch, and the side effect WAS flushed 
to disk
 # User repairs Metastorage on the remaining nodes (they don't contain X in 
their logs and did not apply it)
 # A is brought back online. X was the only diverging command, and the fact of 
its existence was not saved durably to the Metastorage underlying storage (it 
was saved to the Raft log on A, but we are not looking at logs during reentry 
procedure). As such, Metastorage checksum on A matches the checksum for the new 
Metastorage leader (say, B)
 # As a result, we allow node A join the new cluster, even though it has some 
side-effect of X persisted

Such a side-effect could be a tuple in new schema if X was an ALTER TABLE 
command; A would contain this tuple version, but other nodes would have no idea 
about this new version. This is an inconsistency we want to avoid.

Possible ways to solve this:
 # Syncing each checksum write. This solves the problem completely, but it's 
too costly, as it turned out. We tried and reverted this
 # Syncing checksums column family before flushing partitions to disk. This 
will make sure that the most frequent and important case of writing a tuple 
with new schema about which no one in the cluster knows (described above) is 
solved. However, other possible persistent side-effects (like writing to a 
Vault from a Metastorage watch) are not covered; but these cases might not be 
that important
 # Checksumming not revisions (where we store checksums in the storage itself), 
but Raft log entries (where we only checksum entries containing commands that 
can possibly influence the Metastorage content, like puts and invokes, but not 
safe time propagation and compaction commands). This will require a tight 
integration with JRaft; also, when checking for divergence (when reentering the 
cluster by a node that did not witness a Metastorage repair), we would have to 
be very pessimistic taking into account even log entries that were never 
applied, so we could have false positives
 # Investigate if we can write a storage of our own just for Metastorage 
checksums, maybe it will be able to fsync each individual checksum fast, so 
item 1 will become viable

  was:
The following is possible:
 # Metastorage majority goes down (maybe other nodes that don't vote in 
Metastorage as well)
 # One of those nodes (named A) had a command X in its Metastorage Raft log 
that not a single remaining node has in its log. Before crashing, A had applied 
this command in the Metastorage (this was not flushed to disk), then some 
side-effect was produced via Metastorage watch, and the side effect WAS flushed 
to disk
 # User repairs Metastorage on the remaining nodes (they don't contain X in 
their logs and did not apply it)
 # A is brought back online. X was the only diverging command, and the fact of 
its existence was not saved durably to the Metastorage underlying storage (it 
was saved to the Raft log on A, but we are not looking at logs during reentry 
procedure). As such, Metastorage checksum on A matches the checksum for the new 
Metastorage leader (say, B)
 # As a result, we allow node A join the new cluster, even though it has some 
side-effect of X persisted

Such a side-effect could be a tuple in new schema if X was an ALTER TABLE 
command; A would contain this tuple version, but other nodes would have no idea 
about this new version. This is an inconsistency we want to avoid.

Possible ways to solve this:
 # Syncing each checksum write. This solves the problem completely, but it's 
too costly, as it turned out. We tried and reverted this
 # Syncing checksums column family before flushing partitions to disk. This 
will make sure that the most frequent and important case of writing a tuple 
with new schema about which no one in the cluster knows (described above) is 
solved. However, other possible persistent side-effects (like writing to a 
Vault from a Metastorage watch) are not covered; but these cases might not be 
that important
 # Checksumming not revisions (where we store checksums in the storage itself), 
but Raft log entries (where we only checksum entries containing commands that 
can possibly influence the Metastorage content, like puts and invokes, but not 
safe time propagation and compaction commands). This will require a tight 
integration with JRaft; also, when checking for divergence (when reentering the 
cluster by a node that did not witness a Metastorage repair), we would have to 
be very pessimistic taking into account even log entries that were never 
applied, so we could have false positives


> Ability to note side effects produced by diverged Metastorage revisions which 
> were lost
> ---------------------------------------------------------------------------------------
>
>                 Key: IGNITE-23910
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23910
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>
> The following is possible:
>  # Metastorage majority goes down (maybe other nodes that don't vote in 
> Metastorage as well)
>  # One of those nodes (named A) had a command X in its Metastorage Raft log 
> that not a single remaining node has in its log. Before crashing, A had 
> applied this command in the Metastorage (this was not flushed to disk), then 
> some side-effect was produced via Metastorage watch, and the side effect WAS 
> flushed to disk
>  # User repairs Metastorage on the remaining nodes (they don't contain X in 
> their logs and did not apply it)
>  # A is brought back online. X was the only diverging command, and the fact 
> of its existence was not saved durably to the Metastorage underlying storage 
> (it was saved to the Raft log on A, but we are not looking at logs during 
> reentry procedure). As such, Metastorage checksum on A matches the checksum 
> for the new Metastorage leader (say, B)
>  # As a result, we allow node A join the new cluster, even though it has some 
> side-effect of X persisted
> Such a side-effect could be a tuple in new schema if X was an ALTER TABLE 
> command; A would contain this tuple version, but other nodes would have no 
> idea about this new version. This is an inconsistency we want to avoid.
> Possible ways to solve this:
>  # Syncing each checksum write. This solves the problem completely, but it's 
> too costly, as it turned out. We tried and reverted this
>  # Syncing checksums column family before flushing partitions to disk. This 
> will make sure that the most frequent and important case of writing a tuple 
> with new schema about which no one in the cluster knows (described above) is 
> solved. However, other possible persistent side-effects (like writing to a 
> Vault from a Metastorage watch) are not covered; but these cases might not be 
> that important
>  # Checksumming not revisions (where we store checksums in the storage 
> itself), but Raft log entries (where we only checksum entries containing 
> commands that can possibly influence the Metastorage content, like puts and 
> invokes, but not safe time propagation and compaction commands). This will 
> require a tight integration with JRaft; also, when checking for divergence 
> (when reentering the cluster by a node that did not witness a Metastorage 
> repair), we would have to be very pessimistic taking into account even log 
> entries that were never applied, so we could have false positives
>  # Investigate if we can write a storage of our own just for Metastorage 
> checksums, maybe it will be able to fsync each individual checksum fast, so 
> item 1 will become viable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to