[jira] Updated: (HBASE-2414) Enhance test suite to be able to specify distributed scenarios

stack (JIRA) Sun, 25 Apr 2010 20:54:01 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-2414:
-------------------------

    Attachment: testmaster-v8.patch

This patch is not for review.  Its a mess that has two testing techniques 
wrapped up in it still and is in need of cleanup.

My pursuit of a 'direct'/nonetwork hbase ran into the weeds; i.e the 'first' 
technique.  All of the client logic down in HConnnectionManager#TableServers 
would need to be redone so its 'direct'.  Subclassing 
HConnectionManager#TableServers helped in that I could leverage what was 
already there but still, a lot to be done. So, i tried doing reproduction of 
cluster bad-cases using old-school minihbasecluster machinations (technique 
'two').

The pursuit of technique 'one' opened up our minihbasecluster code making it so 
I was able to write a simple test to repro what is seen here in the stack trace 
at the head of this issue (patch includes the code in the TestClusterTransition 
junit test).  

Here is what my test is showing...

{code}
2010-04-25 20:31:07,946 INFO  [RegionServer:1] regionserver.HRegionServer(649): 
aborting server at: 192.168.1.106:63335
2010-04-25 20:31:07,956 INFO  [main-EventThread] 
master.ServerManager$ServerExpirer(831): 192.168.1.106,63335,1272252645013 
znode expired
2010-04-25 20:31:07,957 INFO  [main-EventThread] master.RegionManager(795): 
META region removed from onlineMetaRegions
2010-04-25 20:31:07,958 DEBUG [RegionServer:1] zookeeper.ZooKeeperWrapper(682): 
Closed connection with ZooKeeper
2010-04-25 20:31:07,958 INFO  [RegionServer:1] regionserver.HRegionServer(696): 
RegionServer:1 exiting
2010-04-25 20:31:07,960 INFO  [main] regionserver.HRegionServer(261): My 
address is 192.168.1.106:0
2010-04-25 20:31:07,961 INFO  [main] ipc.HBaseRpcMetrics(52): Initializing RPC 
Metrics with hostName=HRegionServer, port=63412
2010-04-25 20:31:07,962 INFO  [main] regionserver.MemStoreFlusher(102): 
globalMemStoreLimit=32.6m, globalMemStoreLimitLowMark=20.4m, maxHeap=81.4m
2010-04-25 20:31:07,963 INFO  [main] 
regionserver.HRegionServer$MajorCompactionChecker(984): Runs every 1000000ms
2010-04-25 20:31:07,968 DEBUG [HMaster] master.HMaster(506): Processing todo: 
ProcessServerShutdown of 192.168.1.106,63335,1272252645013
...
# While processing server shutdown in came the new RS instance w/ same port and 
load balance kicks in....
...
2010-04-25 20:31:08,187 INFO  [RegionServer:1] 
regionserver.HRegionServer(1202): HRegionServer started at: 192.168.1.106:63412
2010-04-25 20:31:08,188 DEBUG [RegionServer:1] zookeeper.ZooKeeperWrapper(398): 
Read ZNode /hbase/root-region-server got 192.168.1.106:63333
2010-04-25 20:31:08,218 DEBUG [pool-1-thread-1] regionserver.HLog$1(1278): 
Thread got 53 to process
2010-04-25 20:31:08,222 DEBUG [IPC Server handler 4 on 60000] 
master.RegionManager$LoadBalancer(1447): Server is overloaded: load=15, 
avg=7.5, slop=0.3
...
# Then fell into... this while processing a close region

2010-04-25 20:31:08,360 DEBUG [HMaster] master.HMaster(506): Processing todo: 
ProcessRegionClose of 2428,fff,1272252656267, false, reassign: true
2010-04-25 20:31:08,362 DEBUG [HMaster] master.RetryableMetaOperation(95): 
Exception in RetryableMetaOperation: 
java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:65)
        at 
org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:89)
        at 
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:510)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:445)
2010-04-25 20:31:08,367 WARN  [HMaster] master.HMaster(546): Processing pending 
operations: ProcessRegionClose of 2428,fff,1272252656267, false, reassign: true
java.lang.RuntimeException: java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:96)
        at 
org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:89)
        at 
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:510)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:445)
Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:65)
        ... 3 more
..

and so on...
{code}

It required some knowledge of minihbasecluster internals but its not too bad 
methinks.  I've added a bunch of doc. so others can follow.  

Let me clean up and repro more of the recent cluster failings in unit test 
scenario using minihbasecluster.

> Enhance test suite to be able to specify distributed scenarios
> --------------------------------------------------------------
>
>                 Key: HBASE-2414
>                 URL: https://issues.apache.org/jira/browse/HBASE-2414
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.20.3
>            Reporter: Karthik Ranganathan
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: directcluster.txt, master2.txt, testmaster-v4.patch, 
> testmaster-v5.patch, testmaster-v7.patch, testmaster-v8.patch
>
>
> We keep finding good cases that are reasonably hard to test, yet the test 
> suite does not encode these. 
> For example: 
> HBASE-2413 Master does not respect generation stamps, may result in meta 
> getting permanently offlined
> HBASE-2312 Possible data loss when RS goes into GC pause while rolling HLog
> I am sure there are many more such "scenarios" we should put into the unit 
> tests. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2414) Enhance test suite to be able to specify distributed scenarios

Reply via email to