Ivan,

Thanks for the detailed explanation.
I'll try to implement the PoC to check the idea.

On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com> wrote:

> > But how to keep this hash?
> I think, we can just adopt way of storing partition update counters.
> Update counters are:
> 1) Kept and updated in heap, see
> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during
> regular cache operations, no page replacement latency issues)
> 2) Synchronized with page memory (and with disk) on every checkpoint,
> see GridCacheOffheapManager#saveStoreMetadata
> 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter
> 4) On node restart, we init onheap counter with value from disk (for the
> moment of last checkpoint) and update it to latest value during WAL
> logical records replay
>
> > 2) PME is a rare operation on production cluster, but, seems, we have
> > to check consistency in a regular way.
> > Since we have to finish all operations before the check, should we
> > have fake PME for maintenance check in this case?
>  From my experience, PME happens on prod clusters from time to time
> (several times per week), which can be enough. In case it's needed to
> check consistency more often than regular PMEs occur, we can implement
> command that will trigger fake PME for consistency checking.
>
> Best Regards,
> Ivan Rakov
>
> On 29.04.2019 18:53, Anton Vinogradov wrote:
> > Ivan, thanks for the analysis!
> >
> > >> With having pre-calculated partition hash value, we can
> > automatically detect inconsistent partitions on every PME.
> > Great idea, seems this covers all broken synс cases.
> >
> > It will check alive nodes in case the primary failed immediately
> > and will check rejoining node once it finished a rebalance (PME on
> > becoming an owner).
> > Recovered cluster will be checked on activation PME (or even before
> > that?).
> > Also, warmed cluster will be still warmed after check.
> >
> > Have I missed some cases leads to broken sync except bugs?
> >
> > 1) But how to keep this hash?
> > - It should be automatically persisted on each checkpoint (it should
> > not require recalculation on restore, snapshots should be covered too)
> > (and covered by WAL?).
> > - It should be always available at RAM for every partition (even for
> > cold partitions never updated/readed on this node) to be immediately
> > used once all operations done on PME.
> >
> > Can we have special pages to keep such hashes and never allow their
> > eviction?
> >
> > 2) PME is a rare operation on production cluster, but, seems, we have
> > to check consistency in a regular way.
> > Since we have to finish all operations before the check, should we
> > have fake PME for maintenance check in this case?
> >
> > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com
> > <mailto:ivan.glu...@gmail.com>> wrote:
> >
> >     Hi Anton,
> >
> >     Thanks for sharing your ideas.
> >     I think your approach should work in general. I'll just share my
> >     concerns about possible issues that may come up.
> >
> >     1) Equality of update counters doesn't imply equality of
> >     partitions content under load.
> >     For every update, primary node generates update counter and then
> >     update is delivered to backup node and gets applied with the
> >     corresponding update counter. For example, there are two
> >     transactions (A and B) that update partition X by the following
> >     scenario:
> >     - A updates key1 in partition X on primary node and increments
> >     counter to 10
> >     - B updates key2 in partition X on primary node and increments
> >     counter to 11
> >     - While A is still updating another keys, B is finally committed
> >     - Update of key2 arrives to backup node and sets update counter to 11
> >     Observer will see equal update counters (11), but update of key 1
> >     is still missing in the backup partition.
> >     This is a fundamental problem which is being solved here:
> >     https://issues.apache.org/jira/browse/IGNITE-10078
> >     "Online verify" should operate with new complex update counters
> >     which take such "update holes" into account. Otherwise, online
> >     verify may provide false-positive inconsistency reports.
> >
> >     2) Acquisition and comparison of update counters is fast, but
> >     partition hash calculation is long. We should check that update
> >     counter remains unchanged after every K keys handled.
> >
> >     3)
> >
> >>     Another hope is that we'll be able to pause/continue scan, for
> >>     example, we'll check 1/3 partitions today, 1/3 tomorrow, and in
> >>     three days we'll check the whole cluster.
> >     Totally makes sense.
> >     We may find ourselves into a situation where some "hot" partitions
> >     are still unprocessed, and every next attempt to calculate
> >     partition hash fails due to another concurrent update. We should
> >     be able to track progress of validation (% of calculation time
> >     wasted due to concurrent operations may be a good metric, 100% is
> >     the worst case) and provide option to stop/pause activity.
> >     I think, pause should return an "intermediate results report" with
> >     information about which partitions have been successfully checked.
> >     With such report, we can resume activity later: partitions from
> >     report will be just skipped.
> >
> >     4)
> >
> >>     Since "Idle verify" uses regular pagmem, I assume it replaces hot
> >>     data with persisted.
> >>     So, we have to warm up the cluster after each check.
> >>     Are there any chances to check without cooling the cluster?
> >     I don't see an easy way to achieve it with our page memory
> >     architecture. We definitely can't just read pages from disk
> >     directly: we need to synchronize page access with concurrent
> >     update operations and checkpoints.
> >     From my point of view, the correct way to solve this issue is
> >     improving our page replacement [1] mechanics by making it truly
> >     scan-resistant.
> >
> >     P. S. There's another possible way of achieving online verify:
> >     instead of on-demand hash calculation, we can always keep
> >     up-to-date hash value for every partition. We'll need to update
> >     hash on every insert/update/remove operation, but there will be no
> >     reordering issues as per function that we use for aggregating hash
> >     results (+) is commutative. With having pre-calculated partition
> >     hash value, we can automatically detect inconsistent partitions on
> >     every PME. What do you think?
> >
> >     [1] -
> >
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
> >
> >     Best Regards,
> >     Ivan Rakov
> >
> >     On 29.04.2019 12:20, Anton Vinogradov wrote:
> >>     Igniters and especially Ivan Rakov,
> >>
> >>     "Idle verify" [1] is a really cool tool, to make sure that
> >>     cluster is consistent.
> >>
> >>     1) But it required to have operations paused during cluster check.
> >>     At some clusters, this check requires hours (3-4 hours at cases I
> >>     saw).
> >>     I've checked the code of "idle verify" and it seems it possible
> >>     to make it "online" with some assumptions.
> >>
> >>     Idea:
> >>     Currently "Idle verify" checks that partitions hashes, generated
> >>     this way
> >>     while (it.hasNextX()) {
> >>     CacheDataRow row = it.nextX();
> >>     partHash += row.key().hashCode();
> >>     partHash +=
> >>
>  Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
> >>     }
> >>     , are the same.
> >>
> >>     What if we'll generate same pairs updateCounter-partitionHash but
> >>     will compare hashes only in case counters are the same?
> >>     So, for example, will ask cluster to generate pairs for 64
> >>     partitions, then will find that 55 have the same counters (was
> >>     not updated during check) and check them.
> >>     The rest (64-55 = 9) partitions will be re-requested and
> >>     rechecked with an additional 55.
> >>     This way we'll be able to check cluster is consistent even in
> >>     сase operations are in progress (just retrying modified).
> >>
> >>     Risks and assumptions:
> >>     Using this strategy we'll check the cluster's consistency ...
> >>     eventually, and the check will take more time even on an idle
> >>     cluster.
> >>     In case operationsPerTimeToGeneratePartitionHashes >
> >>     partitionsCount we'll definitely gain no progress.
> >>     But, in case of the load is not high, we'll be able to check all
> >>     cluster.
> >>
> >>     Another hope is that we'll be able to pause/continue scan, for
> >>     example, we'll check 1/3 partitions today, 1/3 tomorrow, and in
> >>     three days we'll check the whole cluster.
> >>
> >>     Have I missed something?
> >>
> >>     2) Since "Idle verify" uses regular pagmem, I assume it replaces
> >>     hot data with persisted.
> >>     So, we have to warm up the cluster after each check.
> >>     Are there any chances to check without cooling the cluster?
> >>
> >>     [1]
> >>
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> >
>

Reply via email to