[jira] [Updated] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Jean-Daniel Cryans (JIRA) Tue, 28 Jun 2011 15:29:52 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans updated HBASE-3984:
--------------------------------------

    Attachment: HBASE-3984-trunk-v2.patch

New patch for trunk that introduces a new exception for when a RS is asked to 
shutdown. It had a few ripple effects since some methods were accessed in the 
unit tests even after a RS was asked to shutdown (that's why I needed to change 
the way we wait on RSs to die in one test).

I'm also modifying the behavior of SingleServerBulkAssigner, it will not kill 
the master anymore when it gets an exception trying to talk to a RS (unless 
it's interrupted). Instead it will just log the problem and the RIT will get 
timed out by the TimeoutMonitor. It currently works for 
TestSplitTransactionOnCluster.

Most of the unit test pass but it's hard to tell with all the others that are 
already failing if it's caused by this patch.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster 
> recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, 
> HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of 
> "java.io.IOException: Server not running, 
> aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region 
> servers weren't able to talk to the new .META. location because the old one 
> was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to 
> edit .META. and digging in the code I see that 
> CT.waitForMetaServerConnectionDefault -> waitForMeta -> 
> getMetaServerConnection(true) calls verifyRegionLocation since we force the 
> refresh. In this method we check if the RS is good by calling getRegionInfo 
> which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure 
> until it has fully shutdown since every time a RS tries to open a region 
> (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Reply via email to