Nikolay, Initial idea is to count fixes per cache. In other words, event recording should cause metric increment. Could you help me with RR metrics implementation as a part of your metrics journey?
On Thu, Jul 11, 2019 at 10:18 AM Nikolay Izhikov <nizhi...@apache.org> wrote: > Anton, > > > - metrics are better for monitoring, but the Event is enough for my wish > to cover AI with consistency check, > > Can you clarify, do you have plans to add metrics of RR events? > > I think it should be count of incosistency events per cache(maybe per > partition). > > > В Чт, 11/07/2019 в 09:53 +0300, Anton Vinogradov пишет: > > Slava, > > > > Thanks for your review first! > > > > > > Anyway, I left some comments in your pull-request at github. Please > take > > > > a > > > > look. The most annoying thing is poorly documented code :( > > > > Since your comments mostly about Javadoc (does this mean that my solution > > is so great that you ask me only to fix Javadocs :) ?), > > I'd like to propose you to provide fixes as a PR since you have a vision > of > > how it should be made. I'll review them and merge shortly. > > > > > > By the way, is it required to add test related to fail-over > scenarios? > > > > The best check is to use RR at real code. > > For example, I'm injecting RR now to the test with concurrent > modifications > > and restarts [1]. > > > > > > I just checked, the IEP page and I still cannot find Jira tickets > that > > > > should cover existing limitations/improvements. > > > > I would suggest creating the following tasks, at least: > > > > Mostly agree with you, but > > - MVCC is not production ready, > > - not sure near support really required, > > - metrics are better for monitoring, but the Event is enough for my wish > to > > cover AI with consistency check, > > - do we really need Platforms and Thin Client support? > > Also, we should not produce stillborn issue. > > All limitations listed at proxy creation method and they definitely are > not > > showstoppers and may be fixed later if someone interested. > > Сoming back to AI 3.0 discussion, we have A LOT of features and it's > almost > > impossible (require much more time that feature's cost) to support them > all. > > I will be pretty happy in case someone will do this and provide help if > > necessary! > > > > > > Perhaps, it would be a good idea to think about the recovery too > > > > Do you mean per partition check and recovery? > > That's a good idea, but I found it's not easy to imagine API to for such > > tool. > > In case you ready to assist with proper API/design this will definitely > > help. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973 > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин < > slava.kopti...@gmail.com> > > wrote: > > > > > Perhaps, it would be a good idea to think about the recovery tool/ > > > control-utility command that will allow achieving the same goal. > > > If I am not mistaken it was already proposed in the email thread. > > > > > > Thanks, > > > S. > > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин < > slava.kopti...@gmail.com>: > > > > > > > Hi Anton, > > > > > > > > Well, the ReadRepair feature is finally merged and that is good news > :) > > > > > > > > Unfortunately, I cannot find a consensus about the whole > functionality in > > > > any of these topics: > > > > - > > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html > > > > - > > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html > > > > Also, there are no comments/discussion in JIRA. That makes me sad :( > > > > especially when a feature is huge, not obvious and involves changing > > > > > > public > > > > API (and that is the case, I think). > > > > > > > > Anyway, I left some comments in your pull-request at github. Please > take > > > > > > a > > > > look. The most annoying thing is poorly documented code :( > > > > By the way, is it required to add test related to fail-over > scenarios? > > > > > > > > I just checked, the IEP page and I still cannot find Jira tickets > that > > > > should cover existing limitations/improvements. > > > > I would suggest creating the following tasks, at least: > > > > - MVCC support > > > > - Near caches > > > > - Additional metrics (number of violations, number of repaired > entries > > > > etc) > > > > - Ignite C++ (It looks like, .Net is already aware of that feature) > > > > - Thin clients support > > > > - Perhaps, it would be useful to support different strategies to > resolve > > > > inconsistencies > > > > > > > > What do you think? > > > > > > > > Thanks, > > > > S. > > > > > > > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <a...@apache.org>: > > > > > > > > > Folks, > > > > > > > > > > Thanks to everyone for tips and reviews. > > > > > Yardstick checked, no performance drop found. > > > > > Additional measurement: RR get() is just up to 7% slower than > regular > > > > > > get > > > > > on real benchmarks (8 clients, 4 servers, 3 backups) > > > > > Code merged to the master. > > > > > "Must have" tasks created and attached to the IEP. > > > > > > > > > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <a...@apache.org> > wrote: > > > > > > > > > > > Folks, > > > > > > > > > > > > Just a minor update. > > > > > > > > > > > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 > > > > > > tests > > > > > > left, started with 6k :)). > > > > > > During the analisys, I've found some tests with > > > > > > - unexpected repairs at tx caches > > > > > > - inconsistent state after the test finished (different entries > across > > > > > > > > > > the > > > > > > topology) > > > > > > For example, > > > > > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates > obsolete > > > > > > versions on backups in case of retry, fixed [2] > > > > > > - initial cache load generates not equal versions on backups, > fixed > > > > > > [3] > > > > > > - testAccountTxNodeRestart causes unexpected repairs (entries > have > > > > > > different versions), to be investigated. > > > > > > > > > > > > What's next? > > > > > > > > > > > > 1) Going to merge the solution once "RunAll with ReadRepair > enabled" > > > > > > becomes fully green. > > > > > > 2) Going to add special check after each test which will ensure > caches > > > > > > content after the test is consistent. > > > > > > 2.1) The Same check can (should?) be injected to > > > > > > awaitPartitionMapExchange() and similar methods. > > > > > > 3) Update Jepsen tests with RR checks. > > > > > > 4) Introduce per partition RR. > > > > > > > > > > > > So, the final goal is to be sure that Ignite produces only > consistent > > > > > > > > > > data > > > > > > and to have a feature to solve consistency in case we gain > > > > > > inconsistent > > > > > > state somehow. > > > > > > > > > > > > Limitations? > > > > > > > > > > > > Currently, RR has some limitations, but they are not related to > real > > > > > > production cases. > > > > > > In case someone interested to support, for example, MVCC or near > > > > > > caches, > > > > > > please, feel free to contribute. > > > > > > > > > > > > [1] > > > > > > > > > > > > > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest > > > > > > [2] > > > > > > > > > > > > > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c > > > > > > [3] > > > > > > > > > > > > > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13 > > > > > > > > > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <a...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Slava, > > > > > > > > > > > > > > > > I will take a look at your pull request if you don't mind. > > > > > > > > > > > > > > Great news! > > > > > > > > > > > > > > > > In any way, could you please update the IEP page with the > list of > > > > > > > > > constraints/limitations of the proposed approach, TODOs, > etc? > > > > > > > > > > > > > > Not sure we should keep this at IEP until list (#4 from > original > > > > > > > > > > letter) > > > > > > > is not confirmed. > > > > > > > > > > > > > > Currently, I'm checking RunAll with RR enabled to almost each > get > > > > > > > > > > request. > > > > > > > "Almost" means: readRepair = !ctx.readThrough() && > > > > > > > ctx.config().getBackups() > 0 && !ctx.isNear() && > !ctx.mvccEnabled() > > > > > > > For now I have 60 failed tests and amount decreasing. > > > > > > > > > > > > > > > > For instance, I would like to see all these limitations on > the IEP > > > > > > > > > > > > > > page as > > > > > > > > > JIRA tickets. Perhaps, it would be good to create an > epic/umbrella > > > > > > > > > > > > > > ticket > > > > > > > > > in order to track all activities related to `Read Repair` > feature. > > > > > > > > > > > > > > Let's do this at merge day to be sure useless issues will not > be > > > > > > > > > > created. > > > > > > > > > > > > > > On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин < > > > > > > > slava.kopti...@gmail.com> wrote: > > > > > > > > > > > > > > > Hi Anton, > > > > > > > > > > > > > > > > I will take a look at your pull request if you don't mind. > > > > > > > > > > > > > > > > In any way, could you please update the IEP page with the > list of > > > > > > > > constraints/limitations of the proposed approach, TODOs, etc? > > > > > > > > For instance, I would like to see all these limitations on > the IEP > > > > > > > > > > page > > > > > > > > as > > > > > > > > JIRA tickets. Perhaps, it would be good to create an > epic/umbrella > > > > > > > > > > ticket > > > > > > > > in order to track all activities related to `Read Repair` > feature. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > S. > > > > > > > > > > > > > > > > чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <a...@apache.org > >: > > > > > > > > > > > > > > > > > Igniters, > > > > > > > > > I'm glad to introduce Read Repair feature [0] provides > additional > > > > > > > > > consistency guarantee for Ignite. > > > > > > > > > > > > > > > > > > 1) Why we need it? > > > > > > > > > The detailed explanation can be found at IEP-31 [1]. > > > > > > > > > In short, because of bugs, it's possible to gain an > inconsistent > > > > > > > > > > state. > > > > > > > > > We need additional features to handle this case. > > > > > > > > > > > > > > > > > > Currently we able to check cluster using Idle_verify [2] > feature, > > > > > > > > > > but > > > > > > > > it > > > > > > > > > will not fix the data, will not even tell which entries are > > > > > > broken. > > > > > > > > > Read Repair is a feature to understand which entries are > broken > > > > > > and > > > > > to > > > > > > > > fix > > > > > > > > > them. > > > > > > > > > > > > > > > > > > 1) How it works? > > > > > > > > > IgniteCache now able to provide special proxy [3] > > > > > > withReadRepair(). > > > > > > > > > This proxy guarantee that data will be gained from all > owners and > > > > > > > > > > > > > > > > compared. > > > > > > > > > In the case of consistency violation situation, data will > be > > > > > > > > > > recovered > > > > > > > > and > > > > > > > > > a special event recorded. > > > > > > > > > > > > > > > > > > 3) Naming? > > > > > > > > > Feature name based on Cassandra's Read Repair feature [4], > which > > > > > > is > > > > > > > > pretty > > > > > > > > > similar. > > > > > > > > > > > > > > > > > > 4) Limitations which can be fixed in the future? > > > > > > > > > * MVCC and Near caches are not supported. > > > > > > > > > * Atomic caches can be checked (false positive case is > possible > > > > > > on > > > > > > > > this > > > > > > > > > check), but can't be recovered. > > > > > > > > > * Partial entry removal can't be recovered. > > > > > > > > > * Entries streamed using data streamer (using not a > "cache.put" > > > > > > > > > > based > > > > > > > > > updater) and loaded by cache.load > > > > > > > > > are perceived as inconsistent since they may have > different > > > > > > > > > > versions > > > > > > > > for > > > > > > > > > same keys. > > > > > > > > > * Only explicit get operations are supported > (getAndReplace, > > > > > > > > > > > > > > > > getAndPut, > > > > > > > > > etc can be supported in future). > > > > > > > > > > > > > > > > > > 5) What's left? > > > > > > > > > * SQL/ThinClient/etc support. > > > > > > > > > * Metrics (found/repaired). > > > > > > > > > * Simple per-partition recovery feature able to work in > the > > > > > > > > > > > > > > > > background in > > > > > > > > > addition to per-entry recovery feature. > > > > > > > > > > > > > > > > > > 6) Is code checked? > > > > > > > > > * Pull Request #5656 [5] (feature) - has green TC. > > > > > > > > > * Pull Request #6575 [6] (RunAll with the feature > enabled for > > > > > > > > > > every > > > > > > > > get() > > > > > > > > > request) - has a limited amount of failures (because of > data > > > > > > > > > > streamer, > > > > > > > > > cache.load, etc). > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > [0] https://issues.apache.org/jira/browse/IGNITE-10663 > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140 > > > > > > > > > [4] > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html > > > > > > > > > [5] https://github.com/apache/ignite/pull/5656 > > > > > > > > > [6] https://github.com/apache/ignite/pull/6575 > > > > > > > > > >