Anton,

Automatic quorum-based partition drop may work as a partial workaround for IGNITE-10078, but discussed approach surely doesn't replace IGNITE-10078 activity. We still don't know what do to when quorum can't be reached (2 partitions have hash X, 2 have hash Y) and keeping extended update counters is the only way to resolve such case. On the other hand, precalculated partition hashes validation on PME can be a good addition to IGNITE-10078 logic: we'll be able to detect situations when extended update counters are equal, but for some reason (bug or whatsoever) partition contents are different.

Best Regards,
Ivan Rakov

On 06.05.2019 12:27, Anton Vinogradov wrote:
Ivan, just to make sure ...
The discussed case will fully solve the issue [1] in case we'll also add
some strategy to reject partitions with missed updates (updateCnt==Ok,
Hash!=Ok).
For example, we may use the Quorum strategy, when the majority wins.
Sounds correct?

[1] https://issues.apache.org/jira/browse/IGNITE-10078

On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org> wrote:

Ivan,

Thanks for the detailed explanation.
I'll try to implement the PoC to check the idea.

On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com> wrote:

But how to keep this hash?
I think, we can just adopt way of storing partition update counters.
Update counters are:
1) Kept and updated in heap, see
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during
regular cache operations, no page replacement latency issues)
2) Synchronized with page memory (and with disk) on every checkpoint,
see GridCacheOffheapManager#saveStoreMetadata
3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter
4) On node restart, we init onheap counter with value from disk (for the
moment of last checkpoint) and update it to latest value during WAL
logical records replay

2) PME is a rare operation on production cluster, but, seems, we have
to check consistency in a regular way.
Since we have to finish all operations before the check, should we
have fake PME for maintenance check in this case?
  From my experience, PME happens on prod clusters from time to time
(several times per week), which can be enough. In case it's needed to
check consistency more often than regular PMEs occur, we can implement
command that will trigger fake PME for consistency checking.

Best Regards,
Ivan Rakov

On 29.04.2019 18:53, Anton Vinogradov wrote:
Ivan, thanks for the analysis!

With having pre-calculated partition hash value, we can
automatically detect inconsistent partitions on every PME.
Great idea, seems this covers all broken synс cases.

It will check alive nodes in case the primary failed immediately
and will check rejoining node once it finished a rebalance (PME on
becoming an owner).
Recovered cluster will be checked on activation PME (or even before
that?).
Also, warmed cluster will be still warmed after check.

Have I missed some cases leads to broken sync except bugs?

1) But how to keep this hash?
- It should be automatically persisted on each checkpoint (it should
not require recalculation on restore, snapshots should be covered too)
(and covered by WAL?).
- It should be always available at RAM for every partition (even for
cold partitions never updated/readed on this node) to be immediately
used once all operations done on PME.

Can we have special pages to keep such hashes and never allow their
eviction?

2) PME is a rare operation on production cluster, but, seems, we have
to check consistency in a regular way.
Since we have to finish all operations before the check, should we
have fake PME for maintenance check in this case?

On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com
<mailto:ivan.glu...@gmail.com>> wrote:

     Hi Anton,

     Thanks for sharing your ideas.
     I think your approach should work in general. I'll just share my
     concerns about possible issues that may come up.

     1) Equality of update counters doesn't imply equality of
     partitions content under load.
     For every update, primary node generates update counter and then
     update is delivered to backup node and gets applied with the
     corresponding update counter. For example, there are two
     transactions (A and B) that update partition X by the following
     scenario:
     - A updates key1 in partition X on primary node and increments
     counter to 10
     - B updates key2 in partition X on primary node and increments
     counter to 11
     - While A is still updating another keys, B is finally committed
     - Update of key2 arrives to backup node and sets update counter to
11
     Observer will see equal update counters (11), but update of key 1
     is still missing in the backup partition.
     This is a fundamental problem which is being solved here:
     https://issues.apache.org/jira/browse/IGNITE-10078
     "Online verify" should operate with new complex update counters
     which take such "update holes" into account. Otherwise, online
     verify may provide false-positive inconsistency reports.

     2) Acquisition and comparison of update counters is fast, but
     partition hash calculation is long. We should check that update
     counter remains unchanged after every K keys handled.

     3)

     Another hope is that we'll be able to pause/continue scan, for
     example, we'll check 1/3 partitions today, 1/3 tomorrow, and in
     three days we'll check the whole cluster.
     Totally makes sense.
     We may find ourselves into a situation where some "hot" partitions
     are still unprocessed, and every next attempt to calculate
     partition hash fails due to another concurrent update. We should
     be able to track progress of validation (% of calculation time
     wasted due to concurrent operations may be a good metric, 100% is
     the worst case) and provide option to stop/pause activity.
     I think, pause should return an "intermediate results report" with
     information about which partitions have been successfully checked.
     With such report, we can resume activity later: partitions from
     report will be just skipped.

     4)

     Since "Idle verify" uses regular pagmem, I assume it replaces hot
     data with persisted.
     So, we have to warm up the cluster after each check.
     Are there any chances to check without cooling the cluster?
     I don't see an easy way to achieve it with our page memory
     architecture. We definitely can't just read pages from disk
     directly: we need to synchronize page access with concurrent
     update operations and checkpoints.
     From my point of view, the correct way to solve this issue is
     improving our page replacement [1] mechanics by making it truly
     scan-resistant.

     P. S. There's another possible way of achieving online verify:
     instead of on-demand hash calculation, we can always keep
     up-to-date hash value for every partition. We'll need to update
     hash on every insert/update/remove operation, but there will be no
     reordering issues as per function that we use for aggregating hash
     results (+) is commutative. With having pre-calculated partition
     hash value, we can automatically detect inconsistent partitions on
     every PME. What do you think?

     [1] -

https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
     Best Regards,
     Ivan Rakov

     On 29.04.2019 12:20, Anton Vinogradov wrote:
     Igniters and especially Ivan Rakov,

     "Idle verify" [1] is a really cool tool, to make sure that
     cluster is consistent.

     1) But it required to have operations paused during cluster check.
     At some clusters, this check requires hours (3-4 hours at cases I
     saw).
     I've checked the code of "idle verify" and it seems it possible
     to make it "online" with some assumptions.

     Idea:
     Currently "Idle verify" checks that partitions hashes, generated
     this way
     while (it.hasNextX()) {
     CacheDataRow row = it.nextX();
     partHash += row.key().hashCode();
     partHash +=

  Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
     }
     , are the same.

     What if we'll generate same pairs updateCounter-partitionHash but
     will compare hashes only in case counters are the same?
     So, for example, will ask cluster to generate pairs for 64
     partitions, then will find that 55 have the same counters (was
     not updated during check) and check them.
     The rest (64-55 = 9) partitions will be re-requested and
     rechecked with an additional 55.
     This way we'll be able to check cluster is consistent even in
     сase operations are in progress (just retrying modified).

     Risks and assumptions:
     Using this strategy we'll check the cluster's consistency ...
     eventually, and the check will take more time even on an idle
     cluster.
     In case operationsPerTimeToGeneratePartitionHashes >
     partitionsCount we'll definitely gain no progress.
     But, in case of the load is not high, we'll be able to check all
     cluster.

     Another hope is that we'll be able to pause/continue scan, for
     example, we'll check 1/3 partitions today, 1/3 tomorrow, and in
     three days we'll check the whole cluster.

     Have I missed something?

     2) Since "Idle verify" uses regular pagmem, I assume it replaces
     hot data with persisted.
     So, we have to warm up the cluster after each check.
     Are there any chances to check without cooling the cluster?

     [1]

https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums

Reply via email to