[
https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859724#comment-13859724
]
Lars Hofhansl commented on HBASE-8912:
--------------------------------------
There layers of races here it seems.
With my "fix" not to abort the master a RegionServer can now get a new request
to open a region while is it attempting to transition the current attempt in
FAILE_OPEN in ZK. It then ignores that concurrent open request, and thus the
region will forever be stuck in PENDING_OPEN (unless the timeout manager kicks
in of course). The only reason that did not happen before was that the master
would just abort before it can reassign the region a second time.
When I fix that (by taking the region out of RITs before the znode is
transitioned) there is an NPE when trying to remove the HRI from the RITs
(which was hidden by missing the concurrent assign request before).
Terrible. I hope this is better in trunk after the AM rewrite. But since this
issue stems from backported code, I doubt it.
> [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to
> OFFLINE
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8912
> URL: https://issues.apache.org/jira/browse/HBASE-8912
> Project: HBase
> Issue Type: Bug
> Reporter: Enis Soztutar
> Priority: Critical
> Fix For: 0.94.16
>
> Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test
> - testRetrying [Jenkins].html, log.txt,
> org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
>
>
> AM throws this exception which subsequently causes the master to abort:
> {code}
> java.lang.IllegalStateException: Unexpected state :
> testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b.
> state=PENDING_OPEN, ts=1372891751912,
> server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE.
> at
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
> at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> This exception trace is from the failing test TestMetaReaderEditor which is
> failing pretty frequently, but looking at the test code, I think this is not
> a test-only issue, but affects the main code path.
> https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)