[
https://issues.apache.org/jira/browse/HBASE-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092941#comment-13092941
]
ramkrishna.s.vasudevan commented on HBASE-4245:
-----------------------------------------------
followed the steps and tried to reproduce the mentioned scenario in trunk
If RS01 goes down which is hosting the ROOT next time master will try to assign
the ROOT to someother RS as ROOT is not available.
{code}
connection = waitForRootServerConnection(timeout);
{code}
Here the RS01 address will be obtained.
So next we try to confirm if still this connection is valid
{code}
return (connection == null)? false:
verifyRegionLocation(connection,
this.rootRegionTracker.getRootRegionLocation(),
HRegionInfo.ROOT_REGIONINFO.getRegionName());
{code}
By trying to connect to RS01. We will get a failure.
{noformat}
Failed verification of -ROOT-,,0 at address=linux146,60020,1314633552492;
java.net.ConnectException: Connection refused
{noformat}
So the master will assign the ROOT to a new RS.
Hence after timeout detects that META was not yet opened it will try to assign
META also.
These are my observations in TRUNK.
I may be missing something. Kindly correct me if am wrong or missing any step.
Iam interested in this JIRA as here TimeOutMonitor gets involved to identify
that META was not assigned.
> Cluster hangs if RS serving root fails during startup sequence
> --------------------------------------------------------------
>
> Key: HBASE-4245
> URL: https://issues.apache.org/jira/browse/HBASE-4245
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.3
> Reporter: Todd Lipcon
> Assignee: ramkrishna.s.vasudevan
>
> On a large-ish cluster, the following sequence of events was seen to happen:
> - master started, ROOT and META were both unassigned
> - ROOT is assigned to rs01
> - META is assigned to rs02
> - Upon open of META, it writes its location into ROOT on rs01
> - rs01 crashes while appending to its HLog due to some other bug
> - rs02 fails the region open sequence
> - master notices that rs01 has crashed, and enqueues a ServerShutdownHandler
> - ServerShutdownHandler blocks on CatalogTracker.waitForMeta() since ROOT and
> META are not assigned yet
> - master times out assignment of META, but never succeeds because ROOT
> location is still marked as rs01
> This causes the cluster to never start up.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira