[
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799133#comment-13799133
]
chendihao commented on HBASE-9802:
----------------------------------
Thanks for paying attention on our work. Now we're trying to seperate HBase
things from this framework and reuse for HDFS, zookeeper and other HA services.
Just like what [[email protected]] has said, we want to make it more generic
and just provide an extensible framework, then everyone can implement their
actions to inject failures in their system.
Thank [~elserj] and we will learn more about Accumulo. Currently we use
tc(traffic control) to simulate network delay, dd to make disk full and other
tools to simulate network/disk/cpu/memory failure. It would be helpful if our
test servers provide these interfaces to use. I think we can do it generally
and share with community.
> A new failover test framework for HBase
> ---------------------------------------
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
> Issue Type: Improvement
> Components: test
> Affects Versions: 0.94.3
> Reporter: chendihao
> Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will
> restart regionserver, force balancer and perform other actions randomly and
> periodically. However, we need a more extensible and full-featured framework
> for our failover test and we find ChaosMonkey cant' suit our needs since it
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover
> test. We extended ChaosMonkey and implement the function to validate data and
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table
> as well.
> 5) The set of actions that caused test failure can be replayed, and the
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were
> exposed and fixed by running this test framework. Moreover, we have a monitor
> program which shows the progress of failover test and make sure our cluster
> is as stable as we want. Now we are trying to make it more general and will
> opensource it later.
--
This message was sent by Atlassian JIRA
(v6.1#6144)