Alexei, Got it. Could you please let me know once PR will be ready for review? Currently, have some questions, but, possible, they caused by non-final PR (eg. why atomic counter still ignores misses?).
On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < alexey.scherbak...@gmail.com> wrote: > Anton, > > 1) Extended counters indeed will answer the question if partition could be > safely restored to synchronized state on all owners. > The only condition - one of owners has no missed updates. > If not, partition must be moved to LOST state, see [1], > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, > > IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY > This is known issue and could happen if all partition owners were > unavailable at some point. > In such case we could try to recover consistency using some complex > recovery protocol as you described. Related ticket [2] > > 2) Bitset implementation is considered as an option in GG Community > Edition. No specific implementation dates at the moment. > > 3) As for "online" partition verification, I think the best option right > now is to do verification partition by partition using read only mode per > group partition under load. > While verification is in progress, all write ops are waiting, not rejected. > This is only 100% reliable way to compare partitions - by touching actual > data, all other ways like pre-computed hash are error prone. > There is already ticket [3] for simplifing grid consistency verification > which could be used as basis for such functionality. > As for avoiding cache pollution, we could try read pages sequentially from > disk without lifting them to pagemem and computing some kind of commutative > hash. It's safe under partition write lock. > > [1] https://issues.apache.org/jira/browse/IGNITE-11611 > [2] https://issues.apache.org/jira/browse/IGNITE-6324 > [3] https://issues.apache.org/jira/browse/IGNITE-11256 > > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <a...@apache.org>: > > > Ivan, > > > > 1) I've checked the PR [1] and it looks like it does not solve the issue > > too. > > AFAICS, the main goal here (at PR) is to produce > > PartitionUpdateCounter#sequential which can be false for all backups, > what > > backup should win in that case? > > > > Is there any IEP or some another design page for this fix? > > > > Looks like extended counters should be able to recover the whole cluster > > even in case all copies of the same partition are broken. > > So, seems, the counter should provide detailed info: > > - biggest applied updateCounter > > - list of all missed counters before biggest applied > > - optional hash > > > > In that case, we'll be able to perform some exchange between broken > copies. > > For example, we'll found that copy1 missed key1, and copy2 missed key2. > > It's pretty simple to fix both copies in that case. > > In case all misses can be solved this way, we'll continue cluster > > activation like it was not broken before. > > > > 2) Seems I see the simpler solution to handle misses (than at PR). > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add > byte > > (or int or long (smaplest possible)) value to special structure. > > This value will represent delta between newUpdateCounter and > > curUpdateCounter in bitmask way. > > In case you'll handle updateCounter less that curUpdateCounter, you > should > > update the value at structure responsible to this delta. > > For example, when you have delta "2 to 6", you will have 00000000 > initially > > and 00011111 finally. > > Each delta update should be finished with check it completed (value == 31 > > in this case). Once it finished, it should be removed from the structure. > > Deltas can and should be reused to solve GC issue. > > > > What do you think about the proposed solution? > > > > 3) Hash computation can be an additional extension for extended counters, > > just one more dimension to be extremely sure everything is ok. > > Any objections? > > > > [1] https://github.com/apache/ignite/pull/5765 > > > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <ivan.glu...@gmail.com> > wrote: > > > > > Anton, > > > > > > Automatic quorum-based partition drop may work as a partial workaround > > > for IGNITE-10078, but discussed approach surely doesn't replace > > > IGNITE-10078 activity. We still don't know what do to when quorum can't > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping > > > extended update counters is the only way to resolve such case. > > > On the other hand, precalculated partition hashes validation on PME can > > > be a good addition to IGNITE-10078 logic: we'll be able to detect > > > situations when extended update counters are equal, but for some reason > > > (bug or whatsoever) partition contents are different. > > > > > > Best Regards, > > > Ivan Rakov > > > > > > On 06.05.2019 12:27, Anton Vinogradov wrote: > > > > Ivan, just to make sure ... > > > > The discussed case will fully solve the issue [1] in case we'll also > > add > > > > some strategy to reject partitions with missed updates > (updateCnt==Ok, > > > > Hash!=Ok). > > > > For example, we may use the Quorum strategy, when the majority wins. > > > > Sounds correct? > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > > > > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org> > > wrote: > > > > > > > >> Ivan, > > > >> > > > >> Thanks for the detailed explanation. > > > >> I'll try to implement the PoC to check the idea. > > > >> > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com> > > > wrote: > > > >> > > > >>>> But how to keep this hash? > > > >>> I think, we can just adopt way of storing partition update > counters. > > > >>> Update counters are: > > > >>> 1) Kept and updated in heap, see > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed > > during > > > >>> regular cache operations, no page replacement latency issues) > > > >>> 2) Synchronized with page memory (and with disk) on every > checkpoint, > > > >>> see GridCacheOffheapManager#saveStoreMetadata > > > >>> 3) Stored in partition meta page, see > > > PagePartitionMetaIO#setUpdateCounter > > > >>> 4) On node restart, we init onheap counter with value from disk > (for > > > the > > > >>> moment of last checkpoint) and update it to latest value during WAL > > > >>> logical records replay > > > >>> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > have > > > >>>> to check consistency in a regular way. > > > >>>> Since we have to finish all operations before the check, should we > > > >>>> have fake PME for maintenance check in this case? > > > >>> From my experience, PME happens on prod clusters from time to > time > > > >>> (several times per week), which can be enough. In case it's needed > to > > > >>> check consistency more often than regular PMEs occur, we can > > implement > > > >>> command that will trigger fake PME for consistency checking. > > > >>> > > > >>> Best Regards, > > > >>> Ivan Rakov > > > >>> > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: > > > >>>> Ivan, thanks for the analysis! > > > >>>> > > > >>>>>> With having pre-calculated partition hash value, we can > > > >>>> automatically detect inconsistent partitions on every PME. > > > >>>> Great idea, seems this covers all broken synс cases. > > > >>>> > > > >>>> It will check alive nodes in case the primary failed immediately > > > >>>> and will check rejoining node once it finished a rebalance (PME on > > > >>>> becoming an owner). > > > >>>> Recovered cluster will be checked on activation PME (or even > before > > > >>>> that?). > > > >>>> Also, warmed cluster will be still warmed after check. > > > >>>> > > > >>>> Have I missed some cases leads to broken sync except bugs? > > > >>>> > > > >>>> 1) But how to keep this hash? > > > >>>> - It should be automatically persisted on each checkpoint (it > should > > > >>>> not require recalculation on restore, snapshots should be covered > > too) > > > >>>> (and covered by WAL?). > > > >>>> - It should be always available at RAM for every partition (even > for > > > >>>> cold partitions never updated/readed on this node) to be > immediately > > > >>>> used once all operations done on PME. > > > >>>> > > > >>>> Can we have special pages to keep such hashes and never allow > their > > > >>>> eviction? > > > >>>> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > have > > > >>>> to check consistency in a regular way. > > > >>>> Since we have to finish all operations before the check, should we > > > >>>> have fake PME for maintenance check in this case? > > > >>>> > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com > > > >>>> <mailto:ivan.glu...@gmail.com>> wrote: > > > >>>> > > > >>>> Hi Anton, > > > >>>> > > > >>>> Thanks for sharing your ideas. > > > >>>> I think your approach should work in general. I'll just share > > my > > > >>>> concerns about possible issues that may come up. > > > >>>> > > > >>>> 1) Equality of update counters doesn't imply equality of > > > >>>> partitions content under load. > > > >>>> For every update, primary node generates update counter and > > then > > > >>>> update is delivered to backup node and gets applied with the > > > >>>> corresponding update counter. For example, there are two > > > >>>> transactions (A and B) that update partition X by the > following > > > >>>> scenario: > > > >>>> - A updates key1 in partition X on primary node and > increments > > > >>>> counter to 10 > > > >>>> - B updates key2 in partition X on primary node and > increments > > > >>>> counter to 11 > > > >>>> - While A is still updating another keys, B is finally > > committed > > > >>>> - Update of key2 arrives to backup node and sets update > counter > > > to > > > >>> 11 > > > >>>> Observer will see equal update counters (11), but update of > > key 1 > > > >>>> is still missing in the backup partition. > > > >>>> This is a fundamental problem which is being solved here: > > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 > > > >>>> "Online verify" should operate with new complex update > counters > > > >>>> which take such "update holes" into account. Otherwise, > online > > > >>>> verify may provide false-positive inconsistency reports. > > > >>>> > > > >>>> 2) Acquisition and comparison of update counters is fast, but > > > >>>> partition hash calculation is long. We should check that > update > > > >>>> counter remains unchanged after every K keys handled. > > > >>>> > > > >>>> 3) > > > >>>> > > > >>>>> Another hope is that we'll be able to pause/continue scan, > for > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and > > in > > > >>>>> three days we'll check the whole cluster. > > > >>>> Totally makes sense. > > > >>>> We may find ourselves into a situation where some "hot" > > > partitions > > > >>>> are still unprocessed, and every next attempt to calculate > > > >>>> partition hash fails due to another concurrent update. We > > should > > > >>>> be able to track progress of validation (% of calculation > time > > > >>>> wasted due to concurrent operations may be a good metric, > 100% > > is > > > >>>> the worst case) and provide option to stop/pause activity. > > > >>>> I think, pause should return an "intermediate results report" > > > with > > > >>>> information about which partitions have been successfully > > > checked. > > > >>>> With such report, we can resume activity later: partitions > from > > > >>>> report will be just skipped. > > > >>>> > > > >>>> 4) > > > >>>> > > > >>>>> Since "Idle verify" uses regular pagmem, I assume it > replaces > > > hot > > > >>>>> data with persisted. > > > >>>>> So, we have to warm up the cluster after each check. > > > >>>>> Are there any chances to check without cooling the cluster? > > > >>>> I don't see an easy way to achieve it with our page memory > > > >>>> architecture. We definitely can't just read pages from disk > > > >>>> directly: we need to synchronize page access with concurrent > > > >>>> update operations and checkpoints. > > > >>>> From my point of view, the correct way to solve this issue is > > > >>>> improving our page replacement [1] mechanics by making it > truly > > > >>>> scan-resistant. > > > >>>> > > > >>>> P. S. There's another possible way of achieving online > verify: > > > >>>> instead of on-demand hash calculation, we can always keep > > > >>>> up-to-date hash value for every partition. We'll need to > update > > > >>>> hash on every insert/update/remove operation, but there will > be > > > no > > > >>>> reordering issues as per function that we use for aggregating > > > hash > > > >>>> results (+) is commutative. With having pre-calculated > > partition > > > >>>> hash value, we can automatically detect inconsistent > partitions > > > on > > > >>>> every PME. What do you think? > > > >>>> > > > >>>> [1] - > > > >>>> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > > >>>> Best Regards, > > > >>>> Ivan Rakov > > > >>>> > > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: > > > >>>>> Igniters and especially Ivan Rakov, > > > >>>>> > > > >>>>> "Idle verify" [1] is a really cool tool, to make sure that > > > >>>>> cluster is consistent. > > > >>>>> > > > >>>>> 1) But it required to have operations paused during cluster > > > check. > > > >>>>> At some clusters, this check requires hours (3-4 hours at > > cases > > > I > > > >>>>> saw). > > > >>>>> I've checked the code of "idle verify" and it seems it > > possible > > > >>>>> to make it "online" with some assumptions. > > > >>>>> > > > >>>>> Idea: > > > >>>>> Currently "Idle verify" checks that partitions hashes, > > generated > > > >>>>> this way > > > >>>>> while (it.hasNextX()) { > > > >>>>> CacheDataRow row = it.nextX(); > > > >>>>> partHash += row.key().hashCode(); > > > >>>>> partHash += > > > >>>>> > > > >>> > > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > > > >>>>> } > > > >>>>> , are the same. > > > >>>>> > > > >>>>> What if we'll generate same pairs > updateCounter-partitionHash > > > but > > > >>>>> will compare hashes only in case counters are the same? > > > >>>>> So, for example, will ask cluster to generate pairs for 64 > > > >>>>> partitions, then will find that 55 have the same counters > (was > > > >>>>> not updated during check) and check them. > > > >>>>> The rest (64-55 = 9) partitions will be re-requested and > > > >>>>> rechecked with an additional 55. > > > >>>>> This way we'll be able to check cluster is consistent even > in > > > >>>>> сase operations are in progress (just retrying modified). > > > >>>>> > > > >>>>> Risks and assumptions: > > > >>>>> Using this strategy we'll check the cluster's consistency > ... > > > >>>>> eventually, and the check will take more time even on an > idle > > > >>>>> cluster. > > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > > > > >>>>> partitionsCount we'll definitely gain no progress. > > > >>>>> But, in case of the load is not high, we'll be able to check > > all > > > >>>>> cluster. > > > >>>>> > > > >>>>> Another hope is that we'll be able to pause/continue scan, > for > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and > > in > > > >>>>> three days we'll check the whole cluster. > > > >>>>> > > > >>>>> Have I missed something? > > > >>>>> > > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it > > replaces > > > >>>>> hot data with persisted. > > > >>>>> So, we have to warm up the cluster after each check. > > > >>>>> Are there any chances to check without cooling the cluster? > > > >>>>> > > > >>>>> [1] > > > >>>>> > > > >>> > > > > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > > > > > > -- > > Best regards, > Alexei Scherbakov >