Hi Mukul, Do you think it makes sense to introduce the idea of MiniOzoneChaosCluster to Ratis to have more testing when facing random down nodes?
-Rui On Wed, Sep 9, 2020 at 12:55 PM Rui Wang <[email protected]> wrote: > Thanks Mukul for your insights! I will explore more on a) and b)! > > > -Rui > > On Wed, Sep 9, 2020 at 12:40 PM Mukul Kumar Singh < > [email protected]> wrote: > >> Hi Rui, >> >> One of the ways, this was tested for Ozone Data Pipeline was using >> >> a) Blockade tests, blockade is a docker based framework where the >> network for one DN can be isolated from the other >> >> b) MiniOzoneChaosCluster - This is a unit test based test, where a >> random datanode was killed and this helped in finding out issues with >> the consistency. >> >> Thanks, >> Mukul >> >> On 09/09/20 11:52 pm, Rui Wang wrote: >> > Hi community, >> > >> > The Ozone SCM HA [1] is happening. Ozone SCM HA utilizes Ratis to build >> its >> > consensus on states. When working on it, one of the hard problems I >> found >> > is split-brian in which two leaders co-exists so SCM HA needs to deal >> with >> > stale commands from the old leader. >> > >> > One of the challenges is how to simulate network partitioning so we can >> > write meaningful tests to verify the implementation of dealing with >> stale >> > commands. This probably will require: >> > >> > 1. Have a config to make the old leader never turn to candidate (e.g. >> > increase the timeout of re-election) >> > 2. Has a way to block the in/out communication of the leader so >> creating a >> > network partitioning case. >> > >> > The 1 should easily work. Do you know how to tackle the 2? >> > >> > >> > [1]: https://issues.apache.org/jira/browse/HDDS-2823 >> > >> > >> > -Rui >> > >> >
