Hi Mukul,

Do you think it makes sense to introduce the idea of MiniOzoneChaosCluster
to Ratis to have more testing when facing random down nodes?


-Rui

On Wed, Sep 9, 2020 at 12:55 PM Rui Wang <[email protected]> wrote:

> Thanks Mukul for your insights! I will explore more on a) and b)!
>
>
> -Rui
>
> On Wed, Sep 9, 2020 at 12:40 PM Mukul Kumar Singh <
> [email protected]> wrote:
>
>> Hi Rui,
>>
>> One of the ways, this was tested for Ozone Data Pipeline was using
>>
>> a) Blockade tests, blockade is a docker based framework where the
>> network for one DN can be isolated from the other
>>
>> b) MiniOzoneChaosCluster - This is a unit test based test, where a
>> random datanode was killed and this helped in finding out issues with
>> the consistency.
>>
>> Thanks,
>> Mukul
>>
>> On 09/09/20 11:52 pm, Rui Wang wrote:
>> > Hi community,
>> >
>> > The Ozone SCM HA [1] is happening. Ozone SCM HA utilizes Ratis to build
>> its
>> > consensus on states. When working on it, one of the hard problems I
>> found
>> > is split-brian in which two leaders co-exists so SCM HA needs to deal
>> with
>> > stale commands from the old leader.
>> >
>> > One of the challenges is how to simulate network partitioning so we can
>> > write meaningful tests to verify the implementation of dealing with
>> stale
>> > commands. This probably will require:
>> >
>> > 1. Have a config to make the old leader never turn to candidate (e.g.
>> > increase the timeout of re-election)
>> > 2. Has a way to block the in/out communication of the leader so
>> creating a
>> > network partitioning case.
>> >
>> > The 1 should easily work. Do you know how to tackle the 2?
>> >
>> >
>> > [1]: https://issues.apache.org/jira/browse/HDDS-2823
>> >
>> >
>> > -Rui
>> >
>>
>

Reply via email to