Thanks Mukul for your insights! I will explore more on a) and b)!

-Rui

On Wed, Sep 9, 2020 at 12:40 PM Mukul Kumar Singh <[email protected]>
wrote:

> Hi Rui,
>
> One of the ways, this was tested for Ozone Data Pipeline was using
>
> a) Blockade tests, blockade is a docker based framework where the
> network for one DN can be isolated from the other
>
> b) MiniOzoneChaosCluster - This is a unit test based test, where a
> random datanode was killed and this helped in finding out issues with
> the consistency.
>
> Thanks,
> Mukul
>
> On 09/09/20 11:52 pm, Rui Wang wrote:
> > Hi community,
> >
> > The Ozone SCM HA [1] is happening. Ozone SCM HA utilizes Ratis to build
> its
> > consensus on states. When working on it, one of the hard problems I found
> > is split-brian in which two leaders co-exists so SCM HA needs to deal
> with
> > stale commands from the old leader.
> >
> > One of the challenges is how to simulate network partitioning so we can
> > write meaningful tests to verify the implementation of dealing with stale
> > commands. This probably will require:
> >
> > 1. Have a config to make the old leader never turn to candidate (e.g.
> > increase the timeout of re-election)
> > 2. Has a way to block the in/out communication of the leader so creating
> a
> > network partitioning case.
> >
> > The 1 should easily work. Do you know how to tackle the 2?
> >
> >
> > [1]: https://issues.apache.org/jira/browse/HDDS-2823
> >
> >
> > -Rui
> >
>

Reply via email to