Ivan, Thanks for the detailed explanation. I'll try to implement the PoC to check the idea.
On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com> wrote: > > But how to keep this hash? > I think, we can just adopt way of storing partition update counters. > Update counters are: > 1) Kept and updated in heap, see > IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during > regular cache operations, no page replacement latency issues) > 2) Synchronized with page memory (and with disk) on every checkpoint, > see GridCacheOffheapManager#saveStoreMetadata > 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter > 4) On node restart, we init onheap counter with value from disk (for the > moment of last checkpoint) and update it to latest value during WAL > logical records replay > > > 2) PME is a rare operation on production cluster, but, seems, we have > > to check consistency in a regular way. > > Since we have to finish all operations before the check, should we > > have fake PME for maintenance check in this case? > From my experience, PME happens on prod clusters from time to time > (several times per week), which can be enough. In case it's needed to > check consistency more often than regular PMEs occur, we can implement > command that will trigger fake PME for consistency checking. > > Best Regards, > Ivan Rakov > > On 29.04.2019 18:53, Anton Vinogradov wrote: > > Ivan, thanks for the analysis! > > > > >> With having pre-calculated partition hash value, we can > > automatically detect inconsistent partitions on every PME. > > Great idea, seems this covers all broken synс cases. > > > > It will check alive nodes in case the primary failed immediately > > and will check rejoining node once it finished a rebalance (PME on > > becoming an owner). > > Recovered cluster will be checked on activation PME (or even before > > that?). > > Also, warmed cluster will be still warmed after check. > > > > Have I missed some cases leads to broken sync except bugs? > > > > 1) But how to keep this hash? > > - It should be automatically persisted on each checkpoint (it should > > not require recalculation on restore, snapshots should be covered too) > > (and covered by WAL?). > > - It should be always available at RAM for every partition (even for > > cold partitions never updated/readed on this node) to be immediately > > used once all operations done on PME. > > > > Can we have special pages to keep such hashes and never allow their > > eviction? > > > > 2) PME is a rare operation on production cluster, but, seems, we have > > to check consistency in a regular way. > > Since we have to finish all operations before the check, should we > > have fake PME for maintenance check in this case? > > > > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com > > <mailto:ivan.glu...@gmail.com>> wrote: > > > > Hi Anton, > > > > Thanks for sharing your ideas. > > I think your approach should work in general. I'll just share my > > concerns about possible issues that may come up. > > > > 1) Equality of update counters doesn't imply equality of > > partitions content under load. > > For every update, primary node generates update counter and then > > update is delivered to backup node and gets applied with the > > corresponding update counter. For example, there are two > > transactions (A and B) that update partition X by the following > > scenario: > > - A updates key1 in partition X on primary node and increments > > counter to 10 > > - B updates key2 in partition X on primary node and increments > > counter to 11 > > - While A is still updating another keys, B is finally committed > > - Update of key2 arrives to backup node and sets update counter to 11 > > Observer will see equal update counters (11), but update of key 1 > > is still missing in the backup partition. > > This is a fundamental problem which is being solved here: > > https://issues.apache.org/jira/browse/IGNITE-10078 > > "Online verify" should operate with new complex update counters > > which take such "update holes" into account. Otherwise, online > > verify may provide false-positive inconsistency reports. > > > > 2) Acquisition and comparison of update counters is fast, but > > partition hash calculation is long. We should check that update > > counter remains unchanged after every K keys handled. > > > > 3) > > > >> Another hope is that we'll be able to pause/continue scan, for > >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in > >> three days we'll check the whole cluster. > > Totally makes sense. > > We may find ourselves into a situation where some "hot" partitions > > are still unprocessed, and every next attempt to calculate > > partition hash fails due to another concurrent update. We should > > be able to track progress of validation (% of calculation time > > wasted due to concurrent operations may be a good metric, 100% is > > the worst case) and provide option to stop/pause activity. > > I think, pause should return an "intermediate results report" with > > information about which partitions have been successfully checked. > > With such report, we can resume activity later: partitions from > > report will be just skipped. > > > > 4) > > > >> Since "Idle verify" uses regular pagmem, I assume it replaces hot > >> data with persisted. > >> So, we have to warm up the cluster after each check. > >> Are there any chances to check without cooling the cluster? > > I don't see an easy way to achieve it with our page memory > > architecture. We definitely can't just read pages from disk > > directly: we need to synchronize page access with concurrent > > update operations and checkpoints. > > From my point of view, the correct way to solve this issue is > > improving our page replacement [1] mechanics by making it truly > > scan-resistant. > > > > P. S. There's another possible way of achieving online verify: > > instead of on-demand hash calculation, we can always keep > > up-to-date hash value for every partition. We'll need to update > > hash on every insert/update/remove operation, but there will be no > > reordering issues as per function that we use for aggregating hash > > results (+) is commutative. With having pre-calculated partition > > hash value, we can automatically detect inconsistent partitions on > > every PME. What do you think? > > > > [1] - > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > > > Best Regards, > > Ivan Rakov > > > > On 29.04.2019 12:20, Anton Vinogradov wrote: > >> Igniters and especially Ivan Rakov, > >> > >> "Idle verify" [1] is a really cool tool, to make sure that > >> cluster is consistent. > >> > >> 1) But it required to have operations paused during cluster check. > >> At some clusters, this check requires hours (3-4 hours at cases I > >> saw). > >> I've checked the code of "idle verify" and it seems it possible > >> to make it "online" with some assumptions. > >> > >> Idea: > >> Currently "Idle verify" checks that partitions hashes, generated > >> this way > >> while (it.hasNextX()) { > >> CacheDataRow row = it.nextX(); > >> partHash += row.key().hashCode(); > >> partHash += > >> > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > >> } > >> , are the same. > >> > >> What if we'll generate same pairs updateCounter-partitionHash but > >> will compare hashes only in case counters are the same? > >> So, for example, will ask cluster to generate pairs for 64 > >> partitions, then will find that 55 have the same counters (was > >> not updated during check) and check them. > >> The rest (64-55 = 9) partitions will be re-requested and > >> rechecked with an additional 55. > >> This way we'll be able to check cluster is consistent even in > >> сase operations are in progress (just retrying modified). > >> > >> Risks and assumptions: > >> Using this strategy we'll check the cluster's consistency ... > >> eventually, and the check will take more time even on an idle > >> cluster. > >> In case operationsPerTimeToGeneratePartitionHashes > > >> partitionsCount we'll definitely gain no progress. > >> But, in case of the load is not high, we'll be able to check all > >> cluster. > >> > >> Another hope is that we'll be able to pause/continue scan, for > >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in > >> three days we'll check the whole cluster. > >> > >> Have I missed something? > >> > >> 2) Since "Idle verify" uses regular pagmem, I assume it replaces > >> hot data with persisted. > >> So, we have to warm up the cluster after each check. > >> Are there any chances to check without cooling the cluster? > >> > >> [1] > >> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > >