[jira] [Resolved] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Jean-Daniel Cryans (JIRA) Wed, 29 Jun 2011 15:29:56 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans resolved HBASE-3984.
---------------------------------------

      Resolution: Fixed
    Release Note: 
In trunk:
All HRegionInferface methods will now throw a RegionServerStoppedException if 
it's in that state, whereas we used to only check it for a few methods.
SingleServerBulkAssigner will not kill the Master anymore when getting IOEs, 
instead it will just log an error and the TimeoutMonitor will take care of 
picking up the pieces.

In 0.90:
Only a couple of checkOpen calls were added in order to change as less code as 
possible while still fixing the issue.
    Hadoop Flags: [Reviewed]

Commmitted the 0.90 patch to branch and the other patch to trunk including the 
fix that Ted pointed to. Thanks guys for the reviews.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster 
> recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, 
> HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of 
> "java.io.IOException: Server not running, 
> aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region 
> servers weren't able to talk to the new .META. location because the old one 
> was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to 
> edit .META. and digging in the code I see that 
> CT.waitForMetaServerConnectionDefault -> waitForMeta -> 
> getMetaServerConnection(true) calls verifyRegionLocation since we force the 
> refresh. In this method we check if the RS is good by calling getRegionInfo 
> which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure 
> until it has fully shutdown since every time a RS tries to open a region 
> (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Reply via email to