[
https://issues.apache.org/jira/browse/HBASE-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104495#comment-13104495
]
ramkrishna.s.vasudevan commented on HBASE-4400:
-----------------------------------------------
The root cause of the problem here is in HMaster
{code}
// Make sure root and meta assigned before proceeding.
assignRootAndMeta();
// Is this fresh start with no regions assigned or are we a master joining
// an already-running cluster? If regionsCount == 0, then for sure a
// fresh start. TOOD: Be fancier. If regionsCount == 2, perhaps the
// 2 are .META. and -ROOT- and we should fall into the fresh startup
// branch below. For now, do processFailover.
if (regionCount == 0) {
LOG.info("Master startup proceeding: cluster startup");
this.assignmentManager.cleanoutUnassigned();
this.assignmentManager.assignAllUserRegions();
} else {
LOG.info("Master startup proceeding: master failover");
this.assignmentManager.processFailover();
}
{code}
assigning root and meta is done first and only then processfailover is called
where we care about the dead servers and online servers. So now when the
master sees the META in RIT the znode state is OPENED state and we are not able
to bring the META out of transition even by the timeout monitor.
Correct me if my analysis is wrong.
> .META. getting stuck if RS hosting it is dead and znode state is in
> RS_ZK_REGION_OPENED
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-4400
> URL: https://issues.apache.org/jira/browse/HBASE-4400
> Project: HBase
> Issue Type: Bug
> Reporter: ramkrishna.s.vasudevan
> Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.0, 0.90.5
>
>
> Start 2 RS.
> The .META. is being hosted by RS2 but while processing it goes down.
> Now restart the master and RS1. Master gets the RS name from the znode in
> RS_ZK_REGION_OPENED. But as RS2 is not online still the master is not able
> to process the META at all. Please find the logs
> {noformat}
> 2011-09-14 16:43:51,949 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENING, server=linux76,60020,1315998828523,
> region=70236052/-ROOT-
> 2011-09-14 16:43:51,968 INFO org.apache.hadoop.hbase.master.HMaster: -ROOT-
> assigned=1, rit=false, location=linux76:60020
> 2011-09-14 16:43:51,970 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Processing region
> .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-09-14 16:43:51,970 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Failed to find
> linux146,60020,1315998414623 in list of online servers; skipping registration
> of open of .META.,,1.1028785192
> 2011-09-14 16:43:51,971 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Waiting on 1028785192/.META.
> 2011-09-14 16:43:51,983 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENED, server=linux76,60020,1315998828523,
> region=70236052/-ROOT-
> 2011-09-14 16:43:51,986 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED
> event for 70236052; deleting unassigned node
> 2011-09-14 16:43:51,986 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x13267854032001d Deleting existing unassigned node for 70236052
> that is in expected state RS_ZK_REGION_OPENED
> 2011-09-14 16:43:51,998 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x13267854032001d Successfully deleted unassigned node for
> region 70236052 in expected state RS_ZK_REGION_OPENED
> 2011-09-14 16:43:51,999 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
> -ROOT-,,0.70236052 on linux76,60020,1315998828523
> 2011-09-14 16:44:00,945 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=linux146,60020,1315998839724, regionCount=0, userLoad=false
> 2011-09-14 16:46:20,003 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: .META.,,1.1028785192 state=OPEN, ts=0
> 2011-09-14 16:46:20,004 ERROR
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for
> too long, we don't know where region was opened so can't do anything
> {noformat}
> {code}
> regionsInTransition.put(encodedRegionName, new RegionState(
> regionInfo, RegionState.State.OPEN, data.getStamp()));
> ................
> } else {
> HServerInfo hsi = this.serverManager.getServerInfo(sn);
> if (hsi == null) {
> LOG.info("Failed to find " + sn +
> " in list of online servers; skipping registration of open of "
> +
> regionInfo.getRegionNameAsString());
> } else {
> new OpenedRegionHandler(master, this, regionInfo, hsi).process();
> }
> }
> {code}
> So timeout monitor is not able to do anything here
> {code}
> LOG.error("Region has been OPEN for too long, " +
> "we don't know where region was opened so can't do anything");
> synchronized(regionState) {
> regionState.update(regionState.getState());
> }
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira