Ok, thank you пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <nizhi...@apache.org>:
> I did the review. > > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dpav...@apache.org>: > > > Igniters, who did a review of > > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge? > I've > > checked both PR https://github.com/apache/ignite/pull/5656 and Issue, > > and dev.list thread and didn't find any LGTM. > > > > Anton, since you've rejected lazy consensus in our process, we have RTC > in > > that (core) module. So I'd like to know if the fix was covered by the > > review. > > > > Because you're a committer, a reviewer can be non-committer. So, who was > a > > reviewer? Or was process ignored? > > > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <slava.kopti...@gmail.com > >: > > > > > Hello Anton, > > > > > > > I'd like to propose you to provide fixes as a PR since you have a > > vision > > > of how it should be made. I'll review them and merge shortly. > > > Could you please take a look at PR: > > > https://github.com/apache/ignite/pull/6689 > > > > > > > Since your comments mostly about Javadoc (does this mean that my > > solution > > > is so great that you ask me only to fix Javadocs :) ?), > > > In my humble opinion, I would consider this feature as experimental one > > (It > > > does not seem production-ready). > > > Let me clarify this with the following simple example: > > > > > > try { > > > atomicCache.withReadRepair().getAll(keys); > > > } > > > catch (CacheException e) { > > > // What should be done here from the end-user point of view? > > > } > > > > > > 1. Should I consider that my cluster is broken? There is no answer! The > > > false-positive result is possible. > > > 2. What should be done here in order to check/resolve the issue? > > Perhaps, I > > > should restart a node (which one?), restart the whole cluster, put a > new > > > value... > > > 3. IgniteConsistencyViolationException is absolutely useless. It does > not > > > provide any information about the issue and possible way to fix it. > > > > > > It seems that transactional caches are covered much better. > > > > > > > Mostly agree with you, but > > > > - MVCC is not production ready, > > > > - not sure near support really required, > > > > - metrics are better for monitoring, but the Event is enough for my > > wish > > > to > > > > cover AI with consistency check, > > > > - do we really need Platforms and Thin Client support? > > > Well, near caches are widely used and fully transactional, so I think > it > > > makes sense to support the feature for near caches too. > > > .Net is already aware of 'ReadRepair'. It seems to me, that it can be > > > easily supported for C++. I don't see a reason why it should not be > done > > :) > > > > > > > Do you mean per partition check and recovery? That's a good idea, > but I > > > found it's not easy to imagine API to for such tool. > > > Yep, perhaps it can be done on the idle cluster via `idle-verify` > command > > > with additional flag. Agreed, that this approach is not the best one > > > definitely. > > > > > > Thanks, > > > S. > > > > > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <a...@apache.org>: > > > > > > > Slava, > > > > > > > > Thanks for your review first! > > > > > > > > >> Anyway, I left some comments in your pull-request at github. > Please > > > take > > > > a > > > > >> look. The most annoying thing is poorly documented code :( > > > > Since your comments mostly about Javadoc (does this mean that my > > solution > > > > is so great that you ask me only to fix Javadocs :) ?), > > > > I'd like to propose you to provide fixes as a PR since you have a > > vision > > > of > > > > how it should be made. I'll review them and merge shortly. > > > > > > > > >> By the way, is it required to add test related to fail-over > > scenarios? > > > > The best check is to use RR at real code. > > > > For example, I'm injecting RR now to the test with concurrent > > > modifications > > > > and restarts [1]. > > > > > > > > >> I just checked, the IEP page and I still cannot find Jira tickets > > that > > > > >> should cover existing limitations/improvements. > > > > >> I would suggest creating the following tasks, at least: > > > > Mostly agree with you, but > > > > - MVCC is not production ready, > > > > - not sure near support really required, > > > > - metrics are better for monitoring, but the Event is enough for my > > wish > > > to > > > > cover AI with consistency check, > > > > - do we really need Platforms and Thin Client support? > > > > Also, we should not produce stillborn issue. > > > > All limitations listed at proxy creation method and they definitely > are > > > not > > > > showstoppers and may be fixed later if someone interested. > > > > Сoming back to AI 3.0 discussion, we have A LOT of features and it's > > > almost > > > > impossible (require much more time that feature's cost) to support > them > > > > all. > > > > I will be pretty happy in case someone will do this and provide help > if > > > > necessary! > > > > > > > > >> Perhaps, it would be a good idea to think about the recovery too > > > > Do you mean per partition check and recovery? > > > > That's a good idea, but I found it's not easy to imagine API to for > > such > > > > tool. > > > > In case you ready to assist with proper API/design this will > definitely > > > > help. > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973 > > > > > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин < > > > > slava.kopti...@gmail.com> > > > > wrote: > > > > > > > > > Perhaps, it would be a good idea to think about the recovery tool/ > > > > > control-utility command that will allow achieving the same goal. > > > > > If I am not mistaken it was already proposed in the email thread. > > > > > > > > > > Thanks, > > > > > S. > > > > > > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин < > > > slava.kopti...@gmail.com > > > > >: > > > > > > > > > > > Hi Anton, > > > > > > > > > > > > Well, the ReadRepair feature is finally merged and that is good > > news > > > :) > > > > > > > > > > > > Unfortunately, I cannot find a consensus about the whole > > > functionality > > > > in > > > > > > any of these topics: > > > > > > - > > > > > > > > > > > > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html > > > > > > - > > > > > > > > > > > > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html > > > > > > Also, there are no comments/discussion in JIRA. That makes me sad > > :( > > > > > > especially when a feature is huge, not obvious and involves > > changing > > > > > public > > > > > > API (and that is the case, I think). > > > > > > > > > > > > Anyway, I left some comments in your pull-request at github. > Please > > > > take > > > > > a > > > > > > look. The most annoying thing is poorly documented code :( > > > > > > By the way, is it required to add test related to fail-over > > > scenarios? > > > > > > > > > > > > I just checked, the IEP page and I still cannot find Jira tickets > > > that > > > > > > should cover existing limitations/improvements. > > > > > > I would suggest creating the following tasks, at least: > > > > > > - MVCC support > > > > > > - Near caches > > > > > > - Additional metrics (number of violations, number of repaired > > > entries > > > > > > etc) > > > > > > - Ignite C++ (It looks like, .Net is already aware of that > > feature) > > > > > > - Thin clients support > > > > > > - Perhaps, it would be useful to support different strategies to > > > > resolve > > > > > > inconsistencies > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Thanks, > > > > > > S. > > > > > > > > > > > > > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <a...@apache.org>: > > > > > > > > > > > >> Folks, > > > > > >> > > > > > >> Thanks to everyone for tips and reviews. > > > > > >> Yardstick checked, no performance drop found. > > > > > >> Additional measurement: RR get() is just up to 7% slower than > > > regular > > > > > get > > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups) > > > > > >> Code merged to the master. > > > > > >> "Must have" tasks created and attached to the IEP. > > > > > >> > > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <a...@apache.org > > > > > > wrote: > > > > > >> > > > > > >> > Folks, > > > > > >> > > > > > > >> > Just a minor update. > > > > > >> > > > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green now > > (~10 > > > > > tests > > > > > >> > left, started with 6k :)). > > > > > >> > During the analisys, I've found some tests with > > > > > >> > - unexpected repairs at tx caches > > > > > >> > - inconsistent state after the test finished (different > entries > > > > across > > > > > >> the > > > > > >> > topology) > > > > > >> > For example, > > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates > > > > obsolete > > > > > >> > versions on backups in case of retry, fixed [2] > > > > > >> > - initial cache load generates not equal versions on backups, > > > fixed > > > > > [3] > > > > > >> > - testAccountTxNodeRestart causes unexpected repairs (entries > > have > > > > > >> > different versions), to be investigated. > > > > > >> > > > > > > >> > What's next? > > > > > >> > > > > > > >> > 1) Going to merge the solution once "RunAll with ReadRepair > > > enabled" > > > > > >> > becomes fully green. > > > > > >> > 2) Going to add special check after each test which will > ensure > > > > caches > > > > > >> > content after the test is consistent. > > > > > >> > 2.1) The Same check can (should?) be injected to > > > > > >> > awaitPartitionMapExchange() and similar methods. > > > > > >> > 3) Update Jepsen tests with RR checks. > > > > > >> > 4) Introduce per partition RR. > > > > > >> > > > > > > >> > So, the final goal is to be sure that Ignite produces only > > > > consistent > > > > > >> data > > > > > >> > and to have a feature to solve consistency in case we gain > > > > > inconsistent > > > > > >> > state somehow. > > > > > >> > > > > > > >> > Limitations? > > > > > >> > > > > > > >> > Currently, RR has some limitations, but they are not related > to > > > real > > > > > >> > production cases. > > > > > >> > In case someone interested to support, for example, MVCC or > near > > > > > caches, > > > > > >> > please, feel free to contribute. > > > > > >> > > > > > > >> > [1] > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest > > > > > >> > [2] > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c > > > > > >> > [3] > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13 > > > > > >> > > > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov < > a...@apache.org > > > > > > > > wrote: > > > > > >> > > > > > > >> >> Slava, > > > > > >> >> > > > > > >> >> >> I will take a look at your pull request if you don't mind. > > > > > >> >> Great news! > > > > > >> >> > > > > > >> >> >> In any way, could you please update the IEP page with the > > list > > > > of > > > > > >> >> >> constraints/limitations of the proposed approach, TODOs, > > etc? > > > > > >> >> Not sure we should keep this at IEP until list (#4 from > > original > > > > > >> letter) > > > > > >> >> is not confirmed. > > > > > >> >> > > > > > >> >> Currently, I'm checking RunAll with RR enabled to almost each > > get > > > > > >> request. > > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() && > > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() && > > > > !ctx.mvccEnabled() > > > > > >> >> For now I have 60 failed tests and amount decreasing. > > > > > >> >> > > > > > >> >> >> For instance, I would like to see all these limitations on > > the > > > > IEP > > > > > >> >> page as > > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create an > > > > epic/umbrella > > > > > >> >> ticket > > > > > >> >> >> in order to track all activities related to `Read Repair` > > > > feature. > > > > > >> >> Let's do this at merge day to be sure useless issues will not > > be > > > > > >> created. > > > > > >> >> > > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин < > > > > > >> >> slava.kopti...@gmail.com> wrote: > > > > > >> >> > > > > > >> >>> Hi Anton, > > > > > >> >>> > > > > > >> >>> I will take a look at your pull request if you don't mind. > > > > > >> >>> > > > > > >> >>> In any way, could you please update the IEP page with the > list > > > of > > > > > >> >>> constraints/limitations of the proposed approach, TODOs, > etc? > > > > > >> >>> For instance, I would like to see all these limitations on > the > > > IEP > > > > > >> page > > > > > >> >>> as > > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an > > > epic/umbrella > > > > > >> ticket > > > > > >> >>> in order to track all activities related to `Read Repair` > > > feature. > > > > > >> >>> > > > > > >> >>> Thanks, > > > > > >> >>> S. > > > > > >> >>> > > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov < > a...@apache.org > > >: > > > > > >> >>> > > > > > >> >>> > Igniters, > > > > > >> >>> > I'm glad to introduce Read Repair feature [0] provides > > > > additional > > > > > >> >>> > consistency guarantee for Ignite. > > > > > >> >>> > > > > > > >> >>> > 1) Why we need it? > > > > > >> >>> > The detailed explanation can be found at IEP-31 [1]. > > > > > >> >>> > In short, because of bugs, it's possible to gain an > > > inconsistent > > > > > >> state. > > > > > >> >>> > We need additional features to handle this case. > > > > > >> >>> > > > > > > >> >>> > Currently we able to check cluster using Idle_verify [2] > > > > feature, > > > > > >> but > > > > > >> >>> it > > > > > >> >>> > will not fix the data, will not even tell which entries > are > > > > > broken. > > > > > >> >>> > Read Repair is a feature to understand which entries are > > > broken > > > > > and > > > > > >> to > > > > > >> >>> fix > > > > > >> >>> > them. > > > > > >> >>> > > > > > > >> >>> > 1) How it works? > > > > > >> >>> > IgniteCache now able to provide special proxy [3] > > > > > withReadRepair(). > > > > > >> >>> > This proxy guarantee that data will be gained from all > > owners > > > > and > > > > > >> >>> compared. > > > > > >> >>> > In the case of consistency violation situation, data will > be > > > > > >> recovered > > > > > >> >>> and > > > > > >> >>> > a special event recorded. > > > > > >> >>> > > > > > > >> >>> > 3) Naming? > > > > > >> >>> > Feature name based on Cassandra's Read Repair feature [4], > > > which > > > > > is > > > > > >> >>> pretty > > > > > >> >>> > similar. > > > > > >> >>> > > > > > > >> >>> > 4) Limitations which can be fixed in the future? > > > > > >> >>> > * MVCC and Near caches are not supported. > > > > > >> >>> > * Atomic caches can be checked (false positive case is > > > > possible > > > > > on > > > > > >> >>> this > > > > > >> >>> > check), but can't be recovered. > > > > > >> >>> > * Partial entry removal can't be recovered. > > > > > >> >>> > * Entries streamed using data streamer (using not a > > > > "cache.put" > > > > > >> based > > > > > >> >>> > updater) and loaded by cache.load > > > > > >> >>> > are perceived as inconsistent since they may have > > different > > > > > >> versions > > > > > >> >>> for > > > > > >> >>> > same keys. > > > > > >> >>> > * Only explicit get operations are supported > > (getAndReplace, > > > > > >> >>> getAndPut, > > > > > >> >>> > etc can be supported in future). > > > > > >> >>> > > > > > > >> >>> > 5) What's left? > > > > > >> >>> > * SQL/ThinClient/etc support. > > > > > >> >>> > * Metrics (found/repaired). > > > > > >> >>> > * Simple per-partition recovery feature able to work in > > the > > > > > >> >>> background in > > > > > >> >>> > addition to per-entry recovery feature. > > > > > >> >>> > > > > > > >> >>> > 6) Is code checked? > > > > > >> >>> > * Pull Request #5656 [5] (feature) - has green TC. > > > > > >> >>> > * Pull Request #6575 [6] (RunAll with the feature > enabled > > > for > > > > > >> every > > > > > >> >>> get() > > > > > >> >>> > request) - has a limited amount of failures (because of > data > > > > > >> streamer, > > > > > >> >>> > cache.load, etc). > > > > > >> >>> > > > > > > >> >>> > Thoughts? > > > > > >> >>> > > > > > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663 > > > > > >> >>> > [1] > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix > > > > > >> >>> > [2] > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > >> > > > > > > > > > > > > > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > > > >> >>> > [3] > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > >> > > > > > > > > > > > > > > > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140 > > > > > >> >>> > [4] > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > >> > > > > > > > > > > > > > > > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html > > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656 > > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575 > > > > > >> >>> > > > > > > >> >>> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > > > > > > >