Hi Anton, Well, the ReadRepair feature is finally merged and that is good news :)
Unfortunately, I cannot find a consensus about the whole functionality in any of these topics: - http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html - http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html Also, there are no comments/discussion in JIRA. That makes me sad :( especially when a feature is huge, not obvious and involves changing public API (and that is the case, I think). Anyway, I left some comments in your pull-request at github. Please take a look. The most annoying thing is poorly documented code :( By the way, is it required to add test related to fail-over scenarios? I just checked, the IEP page and I still cannot find Jira tickets that should cover existing limitations/improvements. I would suggest creating the following tasks, at least: - MVCC support - Near caches - Additional metrics (number of violations, number of repaired entries etc) - Ignite C++ (It looks like, .Net is already aware of that feature) - Thin clients support - Perhaps, it would be useful to support different strategies to resolve inconsistencies What do you think? Thanks, S. ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <a...@apache.org>: > Folks, > > Thanks to everyone for tips and reviews. > Yardstick checked, no performance drop found. > Additional measurement: RR get() is just up to 7% slower than regular get > on real benchmarks (8 clients, 4 servers, 3 backups) > Code merged to the master. > "Must have" tasks created and attached to the IEP. > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <a...@apache.org> wrote: > > > Folks, > > > > Just a minor update. > > > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests > > left, started with 6k :)). > > During the analisys, I've found some tests with > > - unexpected repairs at tx caches > > - inconsistent state after the test finished (different entries across > the > > topology) > > For example, > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete > > versions on backups in case of retry, fixed [2] > > - initial cache load generates not equal versions on backups, fixed [3] > > - testAccountTxNodeRestart causes unexpected repairs (entries have > > different versions), to be investigated. > > > > What's next? > > > > 1) Going to merge the solution once "RunAll with ReadRepair enabled" > > becomes fully green. > > 2) Going to add special check after each test which will ensure caches > > content after the test is consistent. > > 2.1) The Same check can (should?) be injected to > > awaitPartitionMapExchange() and similar methods. > > 3) Update Jepsen tests with RR checks. > > 4) Introduce per partition RR. > > > > So, the final goal is to be sure that Ignite produces only consistent > data > > and to have a feature to solve consistency in case we gain inconsistent > > state somehow. > > > > Limitations? > > > > Currently, RR has some limitations, but they are not related to real > > production cases. > > In case someone interested to support, for example, MVCC or near caches, > > please, feel free to contribute. > > > > [1] > > > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest > > [2] > > > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c > > [3] > > > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13 > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <a...@apache.org> wrote: > > > >> Slava, > >> > >> >> I will take a look at your pull request if you don't mind. > >> Great news! > >> > >> >> In any way, could you please update the IEP page with the list of > >> >> constraints/limitations of the proposed approach, TODOs, etc? > >> Not sure we should keep this at IEP until list (#4 from original letter) > >> is not confirmed. > >> > >> Currently, I'm checking RunAll with RR enabled to almost each get > request. > >> "Almost" means: readRepair = !ctx.readThrough() && > >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled() > >> For now I have 60 failed tests and amount decreasing. > >> > >> >> For instance, I would like to see all these limitations on the IEP > >> page as > >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella > >> ticket > >> >> in order to track all activities related to `Read Repair` feature. > >> Let's do this at merge day to be sure useless issues will not be > created. > >> > >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин < > >> slava.kopti...@gmail.com> wrote: > >> > >>> Hi Anton, > >>> > >>> I will take a look at your pull request if you don't mind. > >>> > >>> In any way, could you please update the IEP page with the list of > >>> constraints/limitations of the proposed approach, TODOs, etc? > >>> For instance, I would like to see all these limitations on the IEP page > >>> as > >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella > ticket > >>> in order to track all activities related to `Read Repair` feature. > >>> > >>> Thanks, > >>> S. > >>> > >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <a...@apache.org>: > >>> > >>> > Igniters, > >>> > I'm glad to introduce Read Repair feature [0] provides additional > >>> > consistency guarantee for Ignite. > >>> > > >>> > 1) Why we need it? > >>> > The detailed explanation can be found at IEP-31 [1]. > >>> > In short, because of bugs, it's possible to gain an inconsistent > state. > >>> > We need additional features to handle this case. > >>> > > >>> > Currently we able to check cluster using Idle_verify [2] feature, but > >>> it > >>> > will not fix the data, will not even tell which entries are broken. > >>> > Read Repair is a feature to understand which entries are broken and > to > >>> fix > >>> > them. > >>> > > >>> > 1) How it works? > >>> > IgniteCache now able to provide special proxy [3] withReadRepair(). > >>> > This proxy guarantee that data will be gained from all owners and > >>> compared. > >>> > In the case of consistency violation situation, data will be > recovered > >>> and > >>> > a special event recorded. > >>> > > >>> > 3) Naming? > >>> > Feature name based on Cassandra's Read Repair feature [4], which is > >>> pretty > >>> > similar. > >>> > > >>> > 4) Limitations which can be fixed in the future? > >>> > * MVCC and Near caches are not supported. > >>> > * Atomic caches can be checked (false positive case is possible on > >>> this > >>> > check), but can't be recovered. > >>> > * Partial entry removal can't be recovered. > >>> > * Entries streamed using data streamer (using not a "cache.put" > based > >>> > updater) and loaded by cache.load > >>> > are perceived as inconsistent since they may have different > versions > >>> for > >>> > same keys. > >>> > * Only explicit get operations are supported (getAndReplace, > >>> getAndPut, > >>> > etc can be supported in future). > >>> > > >>> > 5) What's left? > >>> > * SQL/ThinClient/etc support. > >>> > * Metrics (found/repaired). > >>> > * Simple per-partition recovery feature able to work in the > >>> background in > >>> > addition to per-entry recovery feature. > >>> > > >>> > 6) Is code checked? > >>> > * Pull Request #5656 [5] (feature) - has green TC. > >>> > * Pull Request #6575 [6] (RunAll with the feature enabled for every > >>> get() > >>> > request) - has a limited amount of failures (because of data > streamer, > >>> > cache.load, etc). > >>> > > >>> > Thoughts? > >>> > > >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663 > >>> > [1] > >>> > > >>> > > >>> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix > >>> > [2] > >>> > > >>> > > >>> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > >>> > [3] > >>> > > >>> > > >>> > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140 > >>> > [4] > >>> > > >>> > > >>> > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html > >>> > [5] https://github.com/apache/ignite/pull/5656 > >>> > [6] https://github.com/apache/ignite/pull/6575 > >>> > > >>> > >> >