[ 
https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858510#comment-13858510
 ] 

Lars Hofhansl commented on HBASE-8912:
--------------------------------------

Let me summarize what I found:
# When a region server attempts to open a region and fails it takes the resp. 
znode to PENDING_OPEN followed by FAILED_OPEN in quick succession.
# The HMaster thus get two notifications from ZK.
# If the znode transitioned to FAILED_OPEN before the HMaster could react to 
PENDING_OPEN it will have two concurrent threads that read FAILED_OPEN from the 
znode.  (note that AssignmentManager.handleRegion spawns a new thread to run 
the ClosedRegionHandler upon FAILED_OPEN).
# Now HMaster tries to concurrently assign the same region twice.

Hence the task is to (a) allow only one outstanding operation per region (b) 
get out of the way in the 2nd attempt or (c) don't react twice to the same 
version of the same znode.
a) and c) are too risky a change in 0.94, I think. But that would be the 
cleaner avenue.
b) is what both of my suggested patches do, let the first assignment attempt 
continue, stop the 2nd

[~jmspaggi],
In RecoverableZooKeeper.setData(...) I see a specific check for BADVERSION. Are 
you sure absolutely you're running a consistent version of HBase (and the 
latest 0.94)?

This looks like a different (also bad) issue to me. Maybe masked by the master 
dying earlier, but I doubt it.

[~stack],
bq. So, with this patch, is it the timeout monitor thread that effects the 
repair and gets us going again? That would be better than a crash.
The scenario that I found that we tried to assign the same region twice 
concurrently. Not aborting and not continuing with the assignment when we 
detect this condition let's the first assignment go through.

Without a rewrite this is as good as we can do in 0.94. We can't just take the 
node to OFFLINE, as that would interfere with the first assignment attempt 
which is still in progress. As I said above we should really only allow one 
operation at a time for a given region, but that would be a larger rewrite of a 
hairy piece of he code... Not for 0.94.

I would like to commit the alt2 patch here and continue to investigate.


> [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to 
> OFFLINE
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8912
>                 URL: https://issues.apache.org/jira/browse/HBASE-8912
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Lars Hofhansl
>            Priority: Critical
>             Fix For: 0.94.16
>
>         Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test 
> - testRetrying [Jenkins].html, log.txt, 
> org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
>
>
> AM throws this exception which subsequently causes the master to abort: 
> {code}
> java.lang.IllegalStateException: Unexpected state : 
> testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b. 
> state=PENDING_OPEN, ts=1372891751912, 
> server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE.
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
>       at 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>       at java.lang.Thread.run(Thread.java:662)
> {code}
> This exception trace is from the failing test TestMetaReaderEditor which is 
> failing pretty frequently, but looking at the test code, I think this is not 
> a test-only issue, but affects the main code path. 
> https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to