Folks, Thanks to everyone for tips and reviews. Yardstick checked, no performance drop found. Additional measurement: RR get() is just up to 7% slower than regular get on real benchmarks (8 clients, 4 servers, 3 backups) Code merged to the master. "Must have" tasks created and attached to the IEP.
On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <a...@apache.org> wrote: > Folks, > > Just a minor update. > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests > left, started with 6k :)). > During the analisys, I've found some tests with > - unexpected repairs at tx caches > - inconsistent state after the test finished (different entries across the > topology) > For example, > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete > versions on backups in case of retry, fixed [2] > - initial cache load generates not equal versions on backups, fixed [3] > - testAccountTxNodeRestart causes unexpected repairs (entries have > different versions), to be investigated. > > What's next? > > 1) Going to merge the solution once "RunAll with ReadRepair enabled" > becomes fully green. > 2) Going to add special check after each test which will ensure caches > content after the test is consistent. > 2.1) The Same check can (should?) be injected to > awaitPartitionMapExchange() and similar methods. > 3) Update Jepsen tests with RR checks. > 4) Introduce per partition RR. > > So, the final goal is to be sure that Ignite produces only consistent data > and to have a feature to solve consistency in case we gain inconsistent > state somehow. > > Limitations? > > Currently, RR has some limitations, but they are not related to real > production cases. > In case someone interested to support, for example, MVCC or near caches, > please, feel free to contribute. > > [1] > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest > [2] > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c > [3] > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13 > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <a...@apache.org> wrote: > >> Slava, >> >> >> I will take a look at your pull request if you don't mind. >> Great news! >> >> >> In any way, could you please update the IEP page with the list of >> >> constraints/limitations of the proposed approach, TODOs, etc? >> Not sure we should keep this at IEP until list (#4 from original letter) >> is not confirmed. >> >> Currently, I'm checking RunAll with RR enabled to almost each get request. >> "Almost" means: readRepair = !ctx.readThrough() && >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled() >> For now I have 60 failed tests and amount decreasing. >> >> >> For instance, I would like to see all these limitations on the IEP >> page as >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella >> ticket >> >> in order to track all activities related to `Read Repair` feature. >> Let's do this at merge day to be sure useless issues will not be created. >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин < >> slava.kopti...@gmail.com> wrote: >> >>> Hi Anton, >>> >>> I will take a look at your pull request if you don't mind. >>> >>> In any way, could you please update the IEP page with the list of >>> constraints/limitations of the proposed approach, TODOs, etc? >>> For instance, I would like to see all these limitations on the IEP page >>> as >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket >>> in order to track all activities related to `Read Repair` feature. >>> >>> Thanks, >>> S. >>> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <a...@apache.org>: >>> >>> > Igniters, >>> > I'm glad to introduce Read Repair feature [0] provides additional >>> > consistency guarantee for Ignite. >>> > >>> > 1) Why we need it? >>> > The detailed explanation can be found at IEP-31 [1]. >>> > In short, because of bugs, it's possible to gain an inconsistent state. >>> > We need additional features to handle this case. >>> > >>> > Currently we able to check cluster using Idle_verify [2] feature, but >>> it >>> > will not fix the data, will not even tell which entries are broken. >>> > Read Repair is a feature to understand which entries are broken and to >>> fix >>> > them. >>> > >>> > 1) How it works? >>> > IgniteCache now able to provide special proxy [3] withReadRepair(). >>> > This proxy guarantee that data will be gained from all owners and >>> compared. >>> > In the case of consistency violation situation, data will be recovered >>> and >>> > a special event recorded. >>> > >>> > 3) Naming? >>> > Feature name based on Cassandra's Read Repair feature [4], which is >>> pretty >>> > similar. >>> > >>> > 4) Limitations which can be fixed in the future? >>> > * MVCC and Near caches are not supported. >>> > * Atomic caches can be checked (false positive case is possible on >>> this >>> > check), but can't be recovered. >>> > * Partial entry removal can't be recovered. >>> > * Entries streamed using data streamer (using not a "cache.put" based >>> > updater) and loaded by cache.load >>> > are perceived as inconsistent since they may have different versions >>> for >>> > same keys. >>> > * Only explicit get operations are supported (getAndReplace, >>> getAndPut, >>> > etc can be supported in future). >>> > >>> > 5) What's left? >>> > * SQL/ThinClient/etc support. >>> > * Metrics (found/repaired). >>> > * Simple per-partition recovery feature able to work in the >>> background in >>> > addition to per-entry recovery feature. >>> > >>> > 6) Is code checked? >>> > * Pull Request #5656 [5] (feature) - has green TC. >>> > * Pull Request #6575 [6] (RunAll with the feature enabled for every >>> get() >>> > request) - has a limited amount of failures (because of data streamer, >>> > cache.load, etc). >>> > >>> > Thoughts? >>> > >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663 >>> > [1] >>> > >>> > >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix >>> > [2] >>> > >>> > >>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums >>> > [3] >>> > >>> > >>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140 >>> > [4] >>> > >>> > >>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html >>> > [5] https://github.com/apache/ignite/pull/5656 >>> > [6] https://github.com/apache/ignite/pull/6575 >>> > >>> >>