Alexei,

Got it.
Could you please let me know once PR will be ready for review?
Currently, have some questions, but, possible, they caused by non-final PR
(eg. why atomic counter still ignores misses?).

On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov <
alexey.scherbak...@gmail.com> wrote:

> Anton,
>
> 1) Extended counters indeed will answer the question if partition could be
> safely restored to synchronized state on all owners.
> The only condition - one of owners has no missed updates.
> If not, partition must be moved to LOST state, see [1],
> TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test,
>
> IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY
> This is known issue and could happen if all partition owners were
> unavailable at some point.
> In such case we could try to recover consistency using some complex
> recovery protocol as you described. Related ticket [2]
>
> 2) Bitset implementation is considered as an option in GG Community
> Edition. No specific implementation dates at the moment.
>
> 3) As for "online" partition verification,  I think the best option right
> now is to do verification partition by partition using read only mode per
> group partition under load.
> While verification is in progress, all write ops are waiting, not rejected.
> This is only 100% reliable way to compare partitions - by touching actual
> data, all other ways like pre-computed hash are error prone.
> There is already ticket [3] for simplifing grid consistency verification
> which could be used as basis for such functionality.
> As for avoiding cache pollution, we could try read pages sequentially from
> disk without lifting them to pagemem and computing some kind of commutative
> hash. It's safe under partition write lock.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11611
> [2] https://issues.apache.org/jira/browse/IGNITE-6324
> [3] https://issues.apache.org/jira/browse/IGNITE-11256
>
> пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <a...@apache.org>:
>
> > Ivan,
> >
> > 1) I've checked the PR [1] and it looks like it does not solve the issue
> > too.
> > AFAICS, the main goal here (at PR) is to produce
> > PartitionUpdateCounter#sequential which can be false for all backups,
> what
> > backup should win in that case?
> >
> > Is there any IEP or some another design page for this fix?
> >
> > Looks like extended counters should be able to recover the whole cluster
> > even in case all copies of the same partition are broken.
> > So, seems, the counter should provide detailed info:
> > - biggest applied updateCounter
> > - list of all missed counters before biggest applied
> > - optional hash
> >
> > In that case, we'll be able to perform some exchange between broken
> copies.
> > For example, we'll found that copy1 missed key1, and copy2 missed key2.
> > It's pretty simple to fix both copies in that case.
> > In case all misses can be solved this way, we'll continue cluster
> > activation like it was not broken before.
> >
> > 2) Seems I see the simpler solution to handle misses (than at PR).
> > Once you have newUpdateCounter > curUpdateCounter + 1, you should add
> byte
> > (or int or long (smaplest possible)) value to special structure.
> > This value will represent delta between newUpdateCounter and
> > curUpdateCounter in bitmask way.
> > In case you'll handle updateCounter less that curUpdateCounter, you
> should
> > update the value at structure responsible to this delta.
> > For example, when you have delta "2 to 6", you will have 00000000
> initially
> > and 00011111 finally.
> > Each delta update should be finished with check it completed (value == 31
> > in this case). Once it finished, it should be removed from the structure.
> > Deltas can and should be reused to solve GC issue.
> >
> > What do you think about the proposed solution?
> >
> > 3) Hash computation can be an additional extension for extended counters,
> > just one more dimension to be extremely sure everything is ok.
> > Any objections?
> >
> > [1] https://github.com/apache/ignite/pull/5765
> >
> > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <ivan.glu...@gmail.com>
> wrote:
> >
> > > Anton,
> > >
> > > Automatic quorum-based partition drop may work as a partial workaround
> > > for IGNITE-10078, but discussed approach surely doesn't replace
> > > IGNITE-10078 activity. We still don't know what do to when quorum can't
> > > be reached (2 partitions have hash X, 2 have hash Y) and keeping
> > > extended update counters is the only way to resolve such case.
> > > On the other hand, precalculated partition hashes validation on PME can
> > > be a good addition to IGNITE-10078 logic: we'll be able to detect
> > > situations when extended update counters are equal, but for some reason
> > > (bug or whatsoever) partition contents are different.
> > >
> > > Best Regards,
> > > Ivan Rakov
> > >
> > > On 06.05.2019 12:27, Anton Vinogradov wrote:
> > > > Ivan, just to make sure ...
> > > > The discussed case will fully solve the issue [1] in case we'll also
> > add
> > > > some strategy to reject partitions with missed updates
> (updateCnt==Ok,
> > > > Hash!=Ok).
> > > > For example, we may use the Quorum strategy, when the majority wins.
> > > > Sounds correct?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078
> > > >
> > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org>
> > wrote:
> > > >
> > > >> Ivan,
> > > >>
> > > >> Thanks for the detailed explanation.
> > > >> I'll try to implement the PoC to check the idea.
> > > >>
> > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com>
> > > wrote:
> > > >>
> > > >>>> But how to keep this hash?
> > > >>> I think, we can just adopt way of storing partition update
> counters.
> > > >>> Update counters are:
> > > >>> 1) Kept and updated in heap, see
> > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed
> > during
> > > >>> regular cache operations, no page replacement latency issues)
> > > >>> 2) Synchronized with page memory (and with disk) on every
> checkpoint,
> > > >>> see GridCacheOffheapManager#saveStoreMetadata
> > > >>> 3) Stored in partition meta page, see
> > > PagePartitionMetaIO#setUpdateCounter
> > > >>> 4) On node restart, we init onheap counter with value from disk
> (for
> > > the
> > > >>> moment of last checkpoint) and update it to latest value during WAL
> > > >>> logical records replay
> > > >>>
> > > >>>> 2) PME is a rare operation on production cluster, but, seems, we
> > have
> > > >>>> to check consistency in a regular way.
> > > >>>> Since we have to finish all operations before the check, should we
> > > >>>> have fake PME for maintenance check in this case?
> > > >>>   From my experience, PME happens on prod clusters from time to
> time
> > > >>> (several times per week), which can be enough. In case it's needed
> to
> > > >>> check consistency more often than regular PMEs occur, we can
> > implement
> > > >>> command that will trigger fake PME for consistency checking.
> > > >>>
> > > >>> Best Regards,
> > > >>> Ivan Rakov
> > > >>>
> > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote:
> > > >>>> Ivan, thanks for the analysis!
> > > >>>>
> > > >>>>>> With having pre-calculated partition hash value, we can
> > > >>>> automatically detect inconsistent partitions on every PME.
> > > >>>> Great idea, seems this covers all broken synс cases.
> > > >>>>
> > > >>>> It will check alive nodes in case the primary failed immediately
> > > >>>> and will check rejoining node once it finished a rebalance (PME on
> > > >>>> becoming an owner).
> > > >>>> Recovered cluster will be checked on activation PME (or even
> before
> > > >>>> that?).
> > > >>>> Also, warmed cluster will be still warmed after check.
> > > >>>>
> > > >>>> Have I missed some cases leads to broken sync except bugs?
> > > >>>>
> > > >>>> 1) But how to keep this hash?
> > > >>>> - It should be automatically persisted on each checkpoint (it
> should
> > > >>>> not require recalculation on restore, snapshots should be covered
> > too)
> > > >>>> (and covered by WAL?).
> > > >>>> - It should be always available at RAM for every partition (even
> for
> > > >>>> cold partitions never updated/readed on this node) to be
> immediately
> > > >>>> used once all operations done on PME.
> > > >>>>
> > > >>>> Can we have special pages to keep such hashes and never allow
> their
> > > >>>> eviction?
> > > >>>>
> > > >>>> 2) PME is a rare operation on production cluster, but, seems, we
> > have
> > > >>>> to check consistency in a regular way.
> > > >>>> Since we have to finish all operations before the check, should we
> > > >>>> have fake PME for maintenance check in this case?
> > > >>>>
> > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com
> > > >>>> <mailto:ivan.glu...@gmail.com>> wrote:
> > > >>>>
> > > >>>>      Hi Anton,
> > > >>>>
> > > >>>>      Thanks for sharing your ideas.
> > > >>>>      I think your approach should work in general. I'll just share
> > my
> > > >>>>      concerns about possible issues that may come up.
> > > >>>>
> > > >>>>      1) Equality of update counters doesn't imply equality of
> > > >>>>      partitions content under load.
> > > >>>>      For every update, primary node generates update counter and
> > then
> > > >>>>      update is delivered to backup node and gets applied with the
> > > >>>>      corresponding update counter. For example, there are two
> > > >>>>      transactions (A and B) that update partition X by the
> following
> > > >>>>      scenario:
> > > >>>>      - A updates key1 in partition X on primary node and
> increments
> > > >>>>      counter to 10
> > > >>>>      - B updates key2 in partition X on primary node and
> increments
> > > >>>>      counter to 11
> > > >>>>      - While A is still updating another keys, B is finally
> > committed
> > > >>>>      - Update of key2 arrives to backup node and sets update
> counter
> > > to
> > > >>> 11
> > > >>>>      Observer will see equal update counters (11), but update of
> > key 1
> > > >>>>      is still missing in the backup partition.
> > > >>>>      This is a fundamental problem which is being solved here:
> > > >>>>      https://issues.apache.org/jira/browse/IGNITE-10078
> > > >>>>      "Online verify" should operate with new complex update
> counters
> > > >>>>      which take such "update holes" into account. Otherwise,
> online
> > > >>>>      verify may provide false-positive inconsistency reports.
> > > >>>>
> > > >>>>      2) Acquisition and comparison of update counters is fast, but
> > > >>>>      partition hash calculation is long. We should check that
> update
> > > >>>>      counter remains unchanged after every K keys handled.
> > > >>>>
> > > >>>>      3)
> > > >>>>
> > > >>>>>      Another hope is that we'll be able to pause/continue scan,
> for
> > > >>>>>      example, we'll check 1/3 partitions today, 1/3 tomorrow, and
> > in
> > > >>>>>      three days we'll check the whole cluster.
> > > >>>>      Totally makes sense.
> > > >>>>      We may find ourselves into a situation where some "hot"
> > > partitions
> > > >>>>      are still unprocessed, and every next attempt to calculate
> > > >>>>      partition hash fails due to another concurrent update. We
> > should
> > > >>>>      be able to track progress of validation (% of calculation
> time
> > > >>>>      wasted due to concurrent operations may be a good metric,
> 100%
> > is
> > > >>>>      the worst case) and provide option to stop/pause activity.
> > > >>>>      I think, pause should return an "intermediate results report"
> > > with
> > > >>>>      information about which partitions have been successfully
> > > checked.
> > > >>>>      With such report, we can resume activity later: partitions
> from
> > > >>>>      report will be just skipped.
> > > >>>>
> > > >>>>      4)
> > > >>>>
> > > >>>>>      Since "Idle verify" uses regular pagmem, I assume it
> replaces
> > > hot
> > > >>>>>      data with persisted.
> > > >>>>>      So, we have to warm up the cluster after each check.
> > > >>>>>      Are there any chances to check without cooling the cluster?
> > > >>>>      I don't see an easy way to achieve it with our page memory
> > > >>>>      architecture. We definitely can't just read pages from disk
> > > >>>>      directly: we need to synchronize page access with concurrent
> > > >>>>      update operations and checkpoints.
> > > >>>>      From my point of view, the correct way to solve this issue is
> > > >>>>      improving our page replacement [1] mechanics by making it
> truly
> > > >>>>      scan-resistant.
> > > >>>>
> > > >>>>      P. S. There's another possible way of achieving online
> verify:
> > > >>>>      instead of on-demand hash calculation, we can always keep
> > > >>>>      up-to-date hash value for every partition. We'll need to
> update
> > > >>>>      hash on every insert/update/remove operation, but there will
> be
> > > no
> > > >>>>      reordering issues as per function that we use for aggregating
> > > hash
> > > >>>>      results (+) is commutative. With having pre-calculated
> > partition
> > > >>>>      hash value, we can automatically detect inconsistent
> partitions
> > > on
> > > >>>>      every PME. What do you think?
> > > >>>>
> > > >>>>      [1] -
> > > >>>>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
> > > >>>>      Best Regards,
> > > >>>>      Ivan Rakov
> > > >>>>
> > > >>>>      On 29.04.2019 12:20, Anton Vinogradov wrote:
> > > >>>>>      Igniters and especially Ivan Rakov,
> > > >>>>>
> > > >>>>>      "Idle verify" [1] is a really cool tool, to make sure that
> > > >>>>>      cluster is consistent.
> > > >>>>>
> > > >>>>>      1) But it required to have operations paused during cluster
> > > check.
> > > >>>>>      At some clusters, this check requires hours (3-4 hours at
> > cases
> > > I
> > > >>>>>      saw).
> > > >>>>>      I've checked the code of "idle verify" and it seems it
> > possible
> > > >>>>>      to make it "online" with some assumptions.
> > > >>>>>
> > > >>>>>      Idea:
> > > >>>>>      Currently "Idle verify" checks that partitions hashes,
> > generated
> > > >>>>>      this way
> > > >>>>>      while (it.hasNextX()) {
> > > >>>>>      CacheDataRow row = it.nextX();
> > > >>>>>      partHash += row.key().hashCode();
> > > >>>>>      partHash +=
> > > >>>>>
> > > >>>
> >  Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
> > > >>>>>      }
> > > >>>>>      , are the same.
> > > >>>>>
> > > >>>>>      What if we'll generate same pairs
> updateCounter-partitionHash
> > > but
> > > >>>>>      will compare hashes only in case counters are the same?
> > > >>>>>      So, for example, will ask cluster to generate pairs for 64
> > > >>>>>      partitions, then will find that 55 have the same counters
> (was
> > > >>>>>      not updated during check) and check them.
> > > >>>>>      The rest (64-55 = 9) partitions will be re-requested and
> > > >>>>>      rechecked with an additional 55.
> > > >>>>>      This way we'll be able to check cluster is consistent even
> in
> > > >>>>>      сase operations are in progress (just retrying modified).
> > > >>>>>
> > > >>>>>      Risks and assumptions:
> > > >>>>>      Using this strategy we'll check the cluster's consistency
> ...
> > > >>>>>      eventually, and the check will take more time even on an
> idle
> > > >>>>>      cluster.
> > > >>>>>      In case operationsPerTimeToGeneratePartitionHashes >
> > > >>>>>      partitionsCount we'll definitely gain no progress.
> > > >>>>>      But, in case of the load is not high, we'll be able to check
> > all
> > > >>>>>      cluster.
> > > >>>>>
> > > >>>>>      Another hope is that we'll be able to pause/continue scan,
> for
> > > >>>>>      example, we'll check 1/3 partitions today, 1/3 tomorrow, and
> > in
> > > >>>>>      three days we'll check the whole cluster.
> > > >>>>>
> > > >>>>>      Have I missed something?
> > > >>>>>
> > > >>>>>      2) Since "Idle verify" uses regular pagmem, I assume it
> > replaces
> > > >>>>>      hot data with persisted.
> > > >>>>>      So, we have to warm up the cluster after each check.
> > > >>>>>      Are there any chances to check without cooling the cluster?
> > > >>>>>
> > > >>>>>      [1]
> > > >>>>>
> > > >>>
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov
>

Reply via email to