Folks,

Thanks to everyone for tips and reviews.
Yardstick checked, no performance drop found.
Additional measurement: RR get() is just up to 7% slower than regular get
on real benchmarks (8 clients, 4 servers, 3 backups)
Code merged to the master.
"Must have" tasks created and attached to the IEP.

On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <a...@apache.org> wrote:

> Folks,
>
> Just a minor update.
>
> RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
> left, started with 6k :)).
> During the analisys, I've found some tests with
> - unexpected repairs at tx caches
> - inconsistent state after the test finished (different entries across the
> topology)
> For example,
> - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> versions on backups in case of retry, fixed [2]
> - initial cache load generates not equal versions on backups, fixed [3]
> - testAccountTxNodeRestart causes unexpected repairs (entries have
> different versions), to be investigated.
>
> What's next?
>
> 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> becomes fully green.
> 2) Going to add special check after each test which will ensure caches
> content after the test is consistent.
> 2.1) The Same check can (should?) be injected to
> awaitPartitionMapExchange() and similar methods.
> 3) Update Jepsen tests with RR checks.
> 4) Introduce per partition RR.
>
> So, the final goal is to be sure that Ignite produces only consistent data
> and to have a feature to solve consistency in case we gain inconsistent
> state somehow.
>
> Limitations?
>
> Currently, RR has some limitations, but they are not related to real
> production cases.
> In case someone interested to support, for example, MVCC or near caches,
> please, feel free to contribute.
>
> [1]
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> [2]
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> [3]
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
>
> On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <a...@apache.org> wrote:
>
>> Slava,
>>
>> >> I will take a look at your pull request if you don't mind.
>> Great news!
>>
>> >> In any way, could you please update the IEP page with the list of
>> >> constraints/limitations of the proposed approach, TODOs, etc?
>> Not sure we should keep this at IEP until list (#4 from original letter)
>> is not confirmed.
>>
>> Currently, I'm checking RunAll with RR enabled to almost each get request.
>> "Almost" means: readRepair = !ctx.readThrough() &&
>> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
>> For now I have 60 failed tests and amount decreasing.
>>
>> >> For instance, I would like to see all these limitations on the IEP
>> page as
>> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
>> ticket
>> >> in order to track all activities related to `Read Repair` feature.
>> Let's do this at merge day to be sure useless issues will not be created.
>>
>> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
>> slava.kopti...@gmail.com> wrote:
>>
>>> Hi Anton,
>>>
>>> I will take a look at your pull request if you don't mind.
>>>
>>> In any way, could you please update the IEP page with the list of
>>> constraints/limitations of the proposed approach, TODOs, etc?
>>> For instance, I would like to see all these limitations on the IEP page
>>> as
>>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
>>> in order to track all activities related to `Read Repair` feature.
>>>
>>> Thanks,
>>> S.
>>>
>>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <a...@apache.org>:
>>>
>>> > Igniters,
>>> > I'm glad to introduce Read Repair feature [0] provides additional
>>> > consistency guarantee for Ignite.
>>> >
>>> > 1) Why we need it?
>>> > The detailed explanation can be found at IEP-31 [1].
>>> > In short, because of bugs, it's possible to gain an inconsistent state.
>>> > We need additional features to handle this case.
>>> >
>>> > Currently we able to check cluster using Idle_verify [2] feature, but
>>> it
>>> > will not fix the data, will not even tell which entries are broken.
>>> > Read Repair is a feature to understand which entries are broken and to
>>> fix
>>> > them.
>>> >
>>> > 1) How it works?
>>> > IgniteCache now able to provide special proxy [3] withReadRepair().
>>> > This proxy guarantee that data will be gained from all owners and
>>> compared.
>>> > In the case of consistency violation situation, data will be recovered
>>> and
>>> > a special event recorded.
>>> >
>>> > 3) Naming?
>>> > Feature name based on Cassandra's Read Repair feature [4], which is
>>> pretty
>>> > similar.
>>> >
>>> > 4) Limitations which can be fixed in the future?
>>> >   * MVCC and Near caches are not supported.
>>> >   * Atomic caches can be checked (false positive case is possible on
>>> this
>>> > check), but can't be recovered.
>>> >   * Partial entry removal can't be recovered.
>>> >   * Entries streamed using data streamer (using not a "cache.put" based
>>> > updater) and loaded by cache.load
>>> >   are perceived as inconsistent since they may have different versions
>>> for
>>> > same keys.
>>> >   * Only explicit get operations are supported (getAndReplace,
>>> getAndPut,
>>> > etc can be supported in future).
>>> >
>>> > 5) What's left?
>>> >   * SQL/ThinClient/etc support.
>>> >   * Metrics (found/repaired).
>>> >   * Simple per-partition recovery feature able to work in the
>>> background in
>>> > addition to per-entry recovery feature.
>>> >
>>> > 6) Is code checked?
>>> >   * Pull Request #5656 [5] (feature) - has green TC.
>>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
>>> get()
>>> > request) - has a limited amount of failures (because of data streamer,
>>> > cache.load, etc).
>>> >
>>> > Thoughts?
>>> >
>>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
>>> > [1]
>>> >
>>> >
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
>>> > [2]
>>> >
>>> >
>>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>>> > [3]
>>> >
>>> >
>>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
>>> > [4]
>>> >
>>> >
>>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>>> > [5] https://github.com/apache/ignite/pull/5656
>>> > [6] https://github.com/apache/ignite/pull/6575
>>> >
>>>
>>

Reply via email to