[
https://issues.apache.org/jira/browse/HBASE-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-2414:
-------------------------
Attachment: testmaster-v8.patch
This patch is not for review. Its a mess that has two testing techniques
wrapped up in it still and is in need of cleanup.
My pursuit of a 'direct'/nonetwork hbase ran into the weeds; i.e the 'first'
technique. All of the client logic down in HConnnectionManager#TableServers
would need to be redone so its 'direct'. Subclassing
HConnectionManager#TableServers helped in that I could leverage what was
already there but still, a lot to be done. So, i tried doing reproduction of
cluster bad-cases using old-school minihbasecluster machinations (technique
'two').
The pursuit of technique 'one' opened up our minihbasecluster code making it so
I was able to write a simple test to repro what is seen here in the stack trace
at the head of this issue (patch includes the code in the TestClusterTransition
junit test).
Here is what my test is showing...
{code}
2010-04-25 20:31:07,946 INFO [RegionServer:1] regionserver.HRegionServer(649):
aborting server at: 192.168.1.106:63335
2010-04-25 20:31:07,956 INFO [main-EventThread]
master.ServerManager$ServerExpirer(831): 192.168.1.106,63335,1272252645013
znode expired
2010-04-25 20:31:07,957 INFO [main-EventThread] master.RegionManager(795):
META region removed from onlineMetaRegions
2010-04-25 20:31:07,958 DEBUG [RegionServer:1] zookeeper.ZooKeeperWrapper(682):
Closed connection with ZooKeeper
2010-04-25 20:31:07,958 INFO [RegionServer:1] regionserver.HRegionServer(696):
RegionServer:1 exiting
2010-04-25 20:31:07,960 INFO [main] regionserver.HRegionServer(261): My
address is 192.168.1.106:0
2010-04-25 20:31:07,961 INFO [main] ipc.HBaseRpcMetrics(52): Initializing RPC
Metrics with hostName=HRegionServer, port=63412
2010-04-25 20:31:07,962 INFO [main] regionserver.MemStoreFlusher(102):
globalMemStoreLimit=32.6m, globalMemStoreLimitLowMark=20.4m, maxHeap=81.4m
2010-04-25 20:31:07,963 INFO [main]
regionserver.HRegionServer$MajorCompactionChecker(984): Runs every 1000000ms
2010-04-25 20:31:07,968 DEBUG [HMaster] master.HMaster(506): Processing todo:
ProcessServerShutdown of 192.168.1.106,63335,1272252645013
...
# While processing server shutdown in came the new RS instance w/ same port and
load balance kicks in....
...
2010-04-25 20:31:08,187 INFO [RegionServer:1]
regionserver.HRegionServer(1202): HRegionServer started at: 192.168.1.106:63412
2010-04-25 20:31:08,188 DEBUG [RegionServer:1] zookeeper.ZooKeeperWrapper(398):
Read ZNode /hbase/root-region-server got 192.168.1.106:63333
2010-04-25 20:31:08,218 DEBUG [pool-1-thread-1] regionserver.HLog$1(1278):
Thread got 53 to process
2010-04-25 20:31:08,222 DEBUG [IPC Server handler 4 on 60000]
master.RegionManager$LoadBalancer(1447): Server is overloaded: load=15,
avg=7.5, slop=0.3
...
# Then fell into... this while processing a close region
2010-04-25 20:31:08,360 DEBUG [HMaster] master.HMaster(506): Processing todo:
ProcessRegionClose of 2428,fff,1272252656267, false, reassign: true
2010-04-25 20:31:08,362 DEBUG [HMaster] master.RetryableMetaOperation(95):
Exception in RetryableMetaOperation:
java.lang.NullPointerException
at
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:65)
at
org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:89)
at
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:510)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:445)
2010-04-25 20:31:08,367 WARN [HMaster] master.HMaster(546): Processing pending
operations: ProcessRegionClose of 2428,fff,1272252656267, false, reassign: true
java.lang.RuntimeException: java.lang.NullPointerException
at
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:96)
at
org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:89)
at
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:510)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:445)
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:65)
... 3 more
..
and so on...
{code}
It required some knowledge of minihbasecluster internals but its not too bad
methinks. I've added a bunch of doc. so others can follow.
Let me clean up and repro more of the recent cluster failings in unit test
scenario using minihbasecluster.
> Enhance test suite to be able to specify distributed scenarios
> --------------------------------------------------------------
>
> Key: HBASE-2414
> URL: https://issues.apache.org/jira/browse/HBASE-2414
> Project: Hadoop HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 0.20.3
> Reporter: Karthik Ranganathan
> Assignee: stack
> Priority: Blocker
> Fix For: 0.20.5, 0.21.0
>
> Attachments: directcluster.txt, master2.txt, testmaster-v4.patch,
> testmaster-v5.patch, testmaster-v7.patch, testmaster-v8.patch
>
>
> We keep finding good cases that are reasonably hard to test, yet the test
> suite does not encode these.
> For example:
> HBASE-2413 Master does not respect generation stamps, may result in meta
> getting permanently offlined
> HBASE-2312 Possible data loss when RS goes into GC pause while rolling HLog
> I am sure there are many more such "scenarios" we should put into the unit
> tests.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.