Thanks Mukul for your insights! I will explore more on a) and b)!
-Rui On Wed, Sep 9, 2020 at 12:40 PM Mukul Kumar Singh <[email protected]> wrote: > Hi Rui, > > One of the ways, this was tested for Ozone Data Pipeline was using > > a) Blockade tests, blockade is a docker based framework where the > network for one DN can be isolated from the other > > b) MiniOzoneChaosCluster - This is a unit test based test, where a > random datanode was killed and this helped in finding out issues with > the consistency. > > Thanks, > Mukul > > On 09/09/20 11:52 pm, Rui Wang wrote: > > Hi community, > > > > The Ozone SCM HA [1] is happening. Ozone SCM HA utilizes Ratis to build > its > > consensus on states. When working on it, one of the hard problems I found > > is split-brian in which two leaders co-exists so SCM HA needs to deal > with > > stale commands from the old leader. > > > > One of the challenges is how to simulate network partitioning so we can > > write meaningful tests to verify the implementation of dealing with stale > > commands. This probably will require: > > > > 1. Have a config to make the old leader never turn to candidate (e.g. > > increase the timeout of re-election) > > 2. Has a way to block the in/out communication of the leader so creating > a > > network partitioning case. > > > > The 1 should easily work. Do you know how to tackle the 2? > > > > > > [1]: https://issues.apache.org/jira/browse/HDDS-2823 > > > > > > -Rui > > >
