Re: Read Repair (ex. Consistency Check) - review request #2

Nikolay Izhikov Thu, 11 Jul 2019 00:19:18 -0700

Anton,

> - metrics are better for monitoring, but the Event is enough for my wish to 
> cover AI with consistency check,


Can you clarify, do you have plans to add metrics of RR events?

I think it should be count of incosistency events per cache(maybe per 
partition).


В Чт, 11/07/2019 в 09:53 +0300, Anton Vinogradov пишет:
> Slava,
> 
> Thanks for your review first!
> 
> > > Anyway, I left some comments in your pull-request at github. Please take
> 
> a
> > > look. The most annoying thing is poorly documented code :(
> 
> Since your comments mostly about Javadoc (does this mean that my solution
> is so great that you ask me only to fix Javadocs :) ?),
> I'd like to propose you to provide fixes as a PR since you have a vision of
> how it should be made. I'll review them and merge shortly.
> 
> > > By the way, is it required to add test related to fail-over scenarios?
> 
> The best check is to use RR at real code.
> For example, I'm injecting RR now to the test with concurrent modifications
> and restarts [1].
> 
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> 
> Mostly agree with you, but
> - MVCC is not production ready,
> - not sure near support really required,
> - metrics are better for monitoring, but the Event is enough for my wish to
> cover AI with consistency check,
> - do we really need Platforms and Thin Client support?
> Also, we should not produce stillborn issue.
> All limitations listed at proxy creation method and they definitely are not
> showstoppers and may be fixed later if someone interested.
> Сoming back to AI 3.0 discussion, we have A LOT of features and it's almost
> impossible (require much more time that feature's cost) to support them all.
> I will be pretty happy in case someone will do this and provide help if
> necessary!
> 
> > > Perhaps, it would be a good idea to think about the recovery too
> 
> Do you mean per partition check and recovery?
> That's a good idea, but I found it's not easy to imagine API to for such
> tool.
> In case you ready to assist with proper API/design this will definitely
> help.
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-11973
> 
> On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <slava.kopti...@gmail.com>
> wrote:
> 
> > Perhaps, it would be a good idea to think about the recovery tool/
> > control-utility command that will allow achieving the same goal.
> > If I am not mistaken it was already proposed in the email thread.
> > 
> > Thanks,
> > S.
> > 
> > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <slava.kopti...@gmail.com>:
> > 
> > > Hi Anton,
> > > 
> > > Well, the ReadRepair feature is finally merged and that is good news :)
> > > 
> > > Unfortunately, I cannot find a consensus about the whole functionality in
> > > any of these topics:
> > >  -
> > > 
> > 
> > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > >  -
> > > 
> > 
> > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > especially when a feature is huge, not obvious and involves changing
> > 
> > public
> > > API (and that is the case, I think).
> > > 
> > > Anyway, I left some comments in your pull-request at github. Please take
> > 
> > a
> > > look. The most annoying thing is poorly documented code :(
> > > By the way, is it required to add test related to fail-over scenarios?
> > > 
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> > >  - MVCC support
> > >  - Near caches
> > >  - Additional metrics (number of violations, number of repaired entries
> > > etc)
> > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > >  - Thin clients support
> > >  - Perhaps, it would be useful to support different strategies to resolve
> > > inconsistencies
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > S.
> > > 
> > > 
> > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <a...@apache.org>:
> > > 
> > > > Folks,
> > > > 
> > > > Thanks to everyone for tips and reviews.
> > > > Yardstick checked, no performance drop found.
> > > > Additional measurement: RR get() is just up to 7% slower than regular
> > 
> > get
> > > > on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > Code merged to the master.
> > > > "Must have" tasks created and attached to the IEP.
> > > > 
> > > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <a...@apache.org> 
> > > > wrote:
> > > > 
> > > > > Folks,
> > > > > 
> > > > > Just a minor update.
> > > > > 
> > > > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> > 
> > tests
> > > > > left, started with 6k :)).
> > > > > During the analisys, I've found some tests with
> > > > > - unexpected repairs at tx caches
> > > > > - inconsistent state after the test finished (different entries across
> > > > 
> > > > the
> > > > > topology)
> > > > > For example,
> > > > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> > > > > versions on backups in case of retry, fixed [2]
> > > > > - initial cache load generates not equal versions on backups, fixed
> > 
> > [3]
> > > > > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > > > > different versions), to be investigated.
> > > > > 
> > > > > What's next?
> > > > > 
> > > > > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> > > > > becomes fully green.
> > > > > 2) Going to add special check after each test which will ensure caches
> > > > > content after the test is consistent.
> > > > > 2.1) The Same check can (should?) be injected to
> > > > > awaitPartitionMapExchange() and similar methods.
> > > > > 3) Update Jepsen tests with RR checks.
> > > > > 4) Introduce per partition RR.
> > > > > 
> > > > > So, the final goal is to be sure that Ignite produces only consistent
> > > > 
> > > > data
> > > > > and to have a feature to solve consistency in case we gain
> > 
> > inconsistent
> > > > > state somehow.
> > > > > 
> > > > > Limitations?
> > > > > 
> > > > > Currently, RR has some limitations, but they are not related to real
> > > > > production cases.
> > > > > In case someone interested to support, for example, MVCC or near
> > 
> > caches,
> > > > > please, feel free to contribute.
> > > > > 
> > > > > [1]
> > > > > 
> > 
> > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > [2]
> > > > > 
> > 
> > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > [3]
> > > > > 
> > 
> > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > 
> > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <a...@apache.org>
> > 
> > wrote:
> > > > > 
> > > > > > Slava,
> > > > > > 
> > > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > 
> > > > > > Great news!
> > > > > > 
> > > > > > > > In any way, could you please update the IEP page with the list 
> > > > > > > > of
> > > > > > > > constraints/limitations of the proposed approach, TODOs, etc?
> > > > > > 
> > > > > > Not sure we should keep this at IEP until list (#4 from original
> > > > 
> > > > letter)
> > > > > > is not confirmed.
> > > > > > 
> > > > > > Currently, I'm checking RunAll with RR enabled to almost each get
> > > > 
> > > > request.
> > > > > > "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> > > > > > For now I have 60 failed tests and amount decreasing.
> > > > > > 
> > > > > > > > For instance, I would like to see all these limitations on the 
> > > > > > > > IEP
> > > > > > 
> > > > > > page as
> > > > > > > > JIRA tickets. Perhaps, it would be good to create an 
> > > > > > > > epic/umbrella
> > > > > > 
> > > > > > ticket
> > > > > > > > in order to track all activities related to `Read Repair` 
> > > > > > > > feature.
> > > > > > 
> > > > > > Let's do this at merge day to be sure useless issues will not be
> > > > 
> > > > created.
> > > > > > 
> > > > > > On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > slava.kopti...@gmail.com> wrote:
> > > > > > 
> > > > > > > Hi Anton,
> > > > > > > 
> > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > > 
> > > > > > > In any way, could you please update the IEP page with the list of
> > > > > > > constraints/limitations of the proposed approach, TODOs, etc?
> > > > > > > For instance, I would like to see all these limitations on the IEP
> > > > 
> > > > page
> > > > > > > as
> > > > > > > JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > > > 
> > > > ticket
> > > > > > > in order to track all activities related to `Read Repair` feature.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > S.
> > > > > > > 
> > > > > > > чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <a...@apache.org>:
> > > > > > > 
> > > > > > > > Igniters,
> > > > > > > > I'm glad to introduce Read Repair feature [0] provides 
> > > > > > > > additional
> > > > > > > > consistency guarantee for Ignite.
> > > > > > > > 
> > > > > > > > 1) Why we need it?
> > > > > > > > The detailed explanation can be found at IEP-31 [1].
> > > > > > > > In short, because of bugs, it's possible to gain an inconsistent
> > > > 
> > > > state.
> > > > > > > > We need additional features to handle this case.
> > > > > > > > 
> > > > > > > > Currently we able to check cluster using Idle_verify [2] 
> > > > > > > > feature,
> > > > 
> > > > but
> > > > > > > it
> > > > > > > > will not fix the data, will not even tell which entries are
> > 
> > broken.
> > > > > > > > Read Repair is a feature to understand which entries are broken
> > 
> > and
> > > > to
> > > > > > > fix
> > > > > > > > them.
> > > > > > > > 
> > > > > > > > 1) How it works?
> > > > > > > > IgniteCache now able to provide special proxy [3]
> > 
> > withReadRepair().
> > > > > > > > This proxy guarantee that data will be gained from all owners 
> > > > > > > > and
> > > > > > > 
> > > > > > > compared.
> > > > > > > > In the case of consistency violation situation, data will be
> > > > 
> > > > recovered
> > > > > > > and
> > > > > > > > a special event recorded.
> > > > > > > > 
> > > > > > > > 3) Naming?
> > > > > > > > Feature name based on Cassandra's Read Repair feature [4], which
> > 
> > is
> > > > > > > pretty
> > > > > > > > similar.
> > > > > > > > 
> > > > > > > > 4) Limitations which can be fixed in the future?
> > > > > > > >   * MVCC and Near caches are not supported.
> > > > > > > >   * Atomic caches can be checked (false positive case is 
> > > > > > > > possible
> > 
> > on
> > > > > > > this
> > > > > > > > check), but can't be recovered.
> > > > > > > >   * Partial entry removal can't be recovered.
> > > > > > > >   * Entries streamed using data streamer (using not a 
> > > > > > > > "cache.put"
> > > > 
> > > > based
> > > > > > > > updater) and loaded by cache.load
> > > > > > > >   are perceived as inconsistent since they may have different
> > > > 
> > > > versions
> > > > > > > for
> > > > > > > > same keys.
> > > > > > > >   * Only explicit get operations are supported (getAndReplace,
> > > > > > > 
> > > > > > > getAndPut,
> > > > > > > > etc can be supported in future).
> > > > > > > > 
> > > > > > > > 5) What's left?
> > > > > > > >   * SQL/ThinClient/etc support.
> > > > > > > >   * Metrics (found/repaired).
> > > > > > > >   * Simple per-partition recovery feature able to work in the
> > > > > > > 
> > > > > > > background in
> > > > > > > > addition to per-entry recovery feature.
> > > > > > > > 
> > > > > > > > 6) Is code checked?
> > > > > > > >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > > >   * Pull Request #6575 [6] (RunAll with the feature enabled for
> > > > 
> > > > every
> > > > > > > get()
> > > > > > > > request) - has a limited amount of failures (because of data
> > > > 
> > > > streamer,
> > > > > > > > cache.load, etc).
> > > > > > > > 
> > > > > > > > Thoughts?
> > > > > > > > 
> > > > > > > > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > [1]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > [2]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > [3]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > [4]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > >

signature.asc
Description: This is a digitally signed message part

Re: Read Repair (ex. Consistency Check) - review request #2

Reply via email to