[
https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858510#comment-13858510
]
Lars Hofhansl commented on HBASE-8912:
--------------------------------------
Let me summarize what I found:
# When a region server attempts to open a region and fails it takes the resp.
znode to PENDING_OPEN followed by FAILED_OPEN in quick succession.
# The HMaster thus get two notifications from ZK.
# If the znode transitioned to FAILED_OPEN before the HMaster could react to
PENDING_OPEN it will have two concurrent threads that read FAILED_OPEN from the
znode. (note that AssignmentManager.handleRegion spawns a new thread to run
the ClosedRegionHandler upon FAILED_OPEN).
# Now HMaster tries to concurrently assign the same region twice.
Hence the task is to (a) allow only one outstanding operation per region (b)
get out of the way in the 2nd attempt or (c) don't react twice to the same
version of the same znode.
a) and c) are too risky a change in 0.94, I think. But that would be the
cleaner avenue.
b) is what both of my suggested patches do, let the first assignment attempt
continue, stop the 2nd
[~jmspaggi],
In RecoverableZooKeeper.setData(...) I see a specific check for BADVERSION. Are
you sure absolutely you're running a consistent version of HBase (and the
latest 0.94)?
This looks like a different (also bad) issue to me. Maybe masked by the master
dying earlier, but I doubt it.
[~stack],
bq. So, with this patch, is it the timeout monitor thread that effects the
repair and gets us going again? That would be better than a crash.
The scenario that I found that we tried to assign the same region twice
concurrently. Not aborting and not continuing with the assignment when we
detect this condition let's the first assignment go through.
Without a rewrite this is as good as we can do in 0.94. We can't just take the
node to OFFLINE, as that would interfere with the first assignment attempt
which is still in progress. As I said above we should really only allow one
operation at a time for a given region, but that would be a larger rewrite of a
hairy piece of he code... Not for 0.94.
I would like to commit the alt2 patch here and continue to investigate.
> [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to
> OFFLINE
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8912
> URL: https://issues.apache.org/jira/browse/HBASE-8912
> Project: HBase
> Issue Type: Bug
> Reporter: Enis Soztutar
> Assignee: Lars Hofhansl
> Priority: Critical
> Fix For: 0.94.16
>
> Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test
> - testRetrying [Jenkins].html, log.txt,
> org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
>
>
> AM throws this exception which subsequently causes the master to abort:
> {code}
> java.lang.IllegalStateException: Unexpected state :
> testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b.
> state=PENDING_OPEN, ts=1372891751912,
> server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE.
> at
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
> at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> This exception trace is from the failing test TestMetaReaderEditor which is
> failing pretty frequently, but looking at the test code, I think this is not
> a test-only issue, but affects the main code path.
> https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)