[
https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858392#comment-13858392
]
Jean-Marc Spaggiari commented on HBASE-8912:
--------------------------------------------
I tried the patch, and I think that it just moved the issue further :(
First, I restored default balancer to get normal behaviour.
{code}
2013-12-29 13:20:24,408 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
node1.domain.com,60020,1388341141398: Exception refreshing OPENING;
region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for /hbase/unassigned/87dc596f763bd1b43a63c4afd93e4f00
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848)
at
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811)
at
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747)
at
org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:674)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2013-12-29 13:20:24,413 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded
coprocessors are: []
2013-12-29 13:20:24,420 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00,
context=post_region_open
2013-12-29 13:20:24,421 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node
for 404a7ac95dc8ce89826206453c501e2a from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING failed, the node existed and was in the expected state but
then when setting data we got a version mismatch
2013-12-29 13:20:24,423 INFO org.mortbay.log: Stopped
[email protected]:60030
2013-12-29 13:20:24,434 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node
for 87dc596f763bd1b43a63c4afd93e4f00 from RS_ZK_REGION_OPENING to
RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state
M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341141398
2013-12-29 13:20:24,435 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark
region {NAME =>
'page,moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull,1379303806726.87dc596f763bd1b43a63c4afd93e4f00.',
STARTKEY =>
'moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull',
ENDKEY => 'moc.nuhc9.iahgnahs\x1Fhttp\x1F-1\x1F/travels/23865/\x1Fnull',
ENCODED => 87dc596f763bd1b43a63c4afd93e4f00,} as FAILED_OPEN. It's likely that
the master already timed out this open attempt, and thus another RS already has
the region.
2013-12-29 13:20:24,435 ERROR org.apache.hadoop.hbase.executor.EventHandler:
Caught throwable while processing event M_RS_OPEN_REGION
java.io.IOException: Aborting flush because server is abortted...
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1556)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1539)
at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1034)
at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:982)
at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:947)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.cleanupFailedOpen(OpenRegionHandler.java:365)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:115)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}
Ir crashed on region server.
I stopped the cluster, restarted it, and then I got one region pending
transition for more than 5 minutes.
{code}
2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node
for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to
RS_ZK_REGION_OPENING failed, the node existed but was version 7 not the
expected version 6
2013-12-29 13:22:37,716 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
refreshing OPENING; region=75c96fb5c15793e04fb71d553a51619b,
context=post_region_open
2013-12-29 13:22:37,749 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node
for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to
RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state
M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341328265
2013-12-29 13:22:37,751 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark
region {NAME =>
'page,ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011,1384385444837.75c96fb5c15793e04fb71d553a51619b.',
STARTKEY =>
'ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011',
ENDKEY =>
'ac.efilthgin\x1Fhttp\x1F-1\x1F/directory/all/all/all-virtuelle+four-bois+sport+piano+ecrans-geants+europeen+sandwichs+bar-etudiant+desserts+bluegrass+open-bar+jam\x1Fnull',
ENCODED => 75c96fb5c15793e04fb71d553a51619b,} as FAILED_OPEN. It's likely that
the master already timed out this open attempt, and thus another RS already has
the region.
{code}
Then I stopped the master again, and this time it went well.
So just to test, with default balancer, I tried to balancer again and again,
like every 3 minutes to give it a breath between 2 balancing, and I got again a
region stuck in transition.
> [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to
> OFFLINE
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8912
> URL: https://issues.apache.org/jira/browse/HBASE-8912
> Project: HBase
> Issue Type: Bug
> Reporter: Enis Soztutar
> Priority: Critical
> Fix For: 0.94.16
>
> Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test
> - testRetrying [Jenkins].html, log.txt,
> org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
>
>
> AM throws this exception which subsequently causes the master to abort:
> {code}
> java.lang.IllegalStateException: Unexpected state :
> testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b.
> state=PENDING_OPEN, ts=1372891751912,
> server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE.
> at
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
> at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> This exception trace is from the failing test TestMetaReaderEditor which is
> failing pretty frequently, but looking at the test code, I think this is not
> a test-only issue, but affects the main code path.
> https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)