Hi Rui,

One of the ways, this was tested for Ozone Data Pipeline was using

a) Blockade tests, blockade is a docker based framework where the network for one DN can be isolated from the other

b) MiniOzoneChaosCluster - This is a unit test based test, where a random datanode was killed and this helped in finding out issues with the consistency.

Thanks,
Mukul

On 09/09/20 11:52 pm, Rui Wang wrote:
Hi community,

The Ozone SCM HA [1] is happening. Ozone SCM HA utilizes Ratis to build its
consensus on states. When working on it, one of the hard problems I found
is split-brian in which two leaders co-exists so SCM HA needs to deal with
stale commands from the old leader.

One of the challenges is how to simulate network partitioning so we can
write meaningful tests to verify the implementation of dealing with stale
commands. This probably will require:

1. Have a config to make the old leader never turn to candidate (e.g.
increase the timeout of re-election)
2. Has a way to block the in/out communication of the leader so creating a
network partitioning case.

The 1 should easily work. Do you know how to tackle the 2?


[1]: https://issues.apache.org/jira/browse/HDDS-2823


-Rui

Reply via email to