[
https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated HBASE-3984:
--------------------------------------
Attachment: HBASE-3984-trunk-v2.patch
New patch for trunk that introduces a new exception for when a RS is asked to
shutdown. It had a few ripple effects since some methods were accessed in the
unit tests even after a RS was asked to shutdown (that's why I needed to change
the way we wait on RSs to die in one test).
I'm also modifying the behavior of SingleServerBulkAssigner, it will not kill
the master anymore when it gets an exception trying to talk to a RS (unless
it's interrupted). Instead it will just log the problem and the RIT will get
timed out by the TimeoutMonitor. It currently works for
TestSplitTransactionOnCluster.
Most of the unit test pass but it's hard to tell with all the others that are
already failing if it's caused by this patch.
> CT.verifyRegionLocation isn't doing a very good check, can delay cluster
> recovery
> ---------------------------------------------------------------------------------
>
> Key: HBASE-3984
> URL: https://issues.apache.org/jira/browse/HBASE-3984
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: 0.90.4
>
> Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch,
> HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of
> "java.io.IOException: Server not running,
> aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region
> servers weren't able to talk to the new .META. location because the old one
> was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to
> edit .META. and digging in the code I see that
> CT.waitForMetaServerConnectionDefault -> waitForMeta ->
> getMetaServerConnection(true) calls verifyRegionLocation since we force the
> refresh. In this method we check if the RS is good by calling getRegionInfo
> which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure
> until it has fully shutdown since every time a RS tries to open a region
> (like right after the log splitting) or split it fails editing .META.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira