[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Thomas Pan (Commented) (JIRA) Wed, 30 Nov 2011 20:02:06 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160635#comment-13160635
 ]


Thomas Pan commented on HBASE-4925:
-----------------------------------

Here is the list of fault injection test cases that we've collected:
1. Kill -9 one region server and kill -9 the region server that serves the 
.META. table 
2. While BES is writing data to HBase table, kill -9 the region server that 
holds the .META. table 
3. Kill -9 the region server that serves the .META. table. Then, kill -9 the 
region server that serves the -ROOT- table. [Thomas: Is it the case in our 
environment?] 
4. A large number of region servers get killed. After restoration, there is no 
data loss. 
5. No job impact while shifting from the primary HBase master to the secondary 
HBase master. 
6. Shift from the primary HBase master to the secondary HBase master after 
multiple region servers fail. 
7. Shift from the primary HBase master to the secondary HBase master after new 
region servers are added. 
8. Repeatedly stop and restart the primary HBase master. There should be no 
major impact as the secondary HBase master kicks in automatically. 
9. Shift from the primary HBase master to the secondary HBase master while a 
table is creating with 3600 regions. 
10. Disable network access for the node running the region server that serves 
the .META. table 
11. Disable network access for the node running the primary HBase master 
12. Disable network access for the node running the secondary HBase master 
13. Trigger short-lived network interruption for the node running the region 
server that serves the .META. table 
14. Trigger short-lived network interruption for the node running the primary 
HBase master 
15. Trigger short-lived network interruption for the node running the secondary 
HBase master 
16. While BES is writing to a table heavily with high CPU usage in the cluster. 
17. Restart one RS with high CPU usage in the cluster. 
18. Offline data nodes with high CPU usage in the cluster. 
19. While BES is writing to a table heavily with high memory usage in the 
cluster. 
20. Restart one RS with high memory usage in the cluster. 
21. Offline data nodes with high memory usage in the cluster. 
22. With no load in the cluster, test failover of the primary NN to the 
secondary NN 
23. With jobs running in the cluster, test failover of the primary NN to the 
secondary NN 
24. Repeatedly stop and restart the primary NN to make sure that the NN 
failover works fine 
25. Kill -9 the primary zookeeper. The failover to the second NN should be in 
time with no job failure. 
26. Kill -9 the primary zookeeper and the primary NN, the cluster should 
quickly fail over to the secondary ZK and NN 
27. Restart the node that holds the primary NN 
28. Disable network access for the node running the primary NN 
29. Trigger short-lived network interruption for the node running the primary 
NN 
30. Disable network access for the node running the primary ZK 
31. Trigger short-lived network interruption for the node running the primary 
ZK 
32. Disable network access for the node running ZK in follower state 
33. Trigger short-lived network interruption for the node running ZK in 
follower state 
34. Offline multiple data nodes at once. Keep them offline for a while. 
35. Offline multiple data nodes at once. Keep them offline for a while. Put 
them back at once. 
37. Offline multiple data nodes at once. Put them back at once, instantly. 
38. Offline a data node at once. Keep it offline for a while. 
39. Offline a data node at once. Keep it offline for a while. Put it back at 
once. 
40. Offline a data node at once. Put it back at once, instantly. 
41. Hard disk failure in the primary NN triggers NN failover. 
42. The directory dfs.data.dir on data node gets corrupted 
43. Corrupted dfs.name.dir on the primary NN gets detected and triggers NN 
failover. 
44. Corrupted dfs.name.dir on the secondary NN gets detected. 
45. A data node runs out of disk space. 
46. Under heavy IO on data nodes, BES writes to a table heavily. 
47. Under heavy IO on data nodes, offline multiple data nodes.
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a 
> hadoop/hbase cluster. This is to follow up on yesterday's hack day in 
> Salesforce. Hopefully that the information would be very useful for the whole 
> community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Reply via email to