[
https://issues.apache.org/jira/browse/HBASE-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106674#comment-13106674
]
ramkrishna.s.vasudevan commented on HBASE-4400:
-----------------------------------------------
I am attaching the patch. Still testcases are running. I have written a
testcase in the patch. If we run the testcase without the patch it exemplifies
the problem in trunk and 0.90.x.
The problem is very clear in 0.90.x but in trunk there is a little change in
how things work when the master finds a region in transition in OPENED state
In trunk
{code}
- } else if (isOnDeadServer(regionInfo, deadServers) &&
- !serverManager.isServerOnline(sn)) {
- // If was on a dead server, then its not open any more; needs
- // handling.
// If was on a dead server, then its not open any more; needs
// handling.
forceOffline(regionInfo, data);
} else {
new OpenedRegionHandler(master, this, regionInfo, sn).process();
}
{code}
Here as per the condition above as the deadserver is not yet populated while
processing meta region the else gets executed and the catalog tracker is
notified of the META region on the dead server(but no META region is opened).
But the IPC call is not able to be established to this dead server and hence
assignmentManager.assignMeta() gets called and tries to assign the region but
the call back doesnt happen when the new transition happens for the META node
from OFFLINE to OPENING and OPENED. (why does the call back doesnot happen-> is
it because metanodetracker executed the nodeDeleted() api. Not sure.)
Now in case of 0.90.x as already commented in my previous comments
{code}
forceOffline(regionInfo, data);
} else {
HServerInfo hsi = this.serverManager.getServerInfo(sn);
if (hsi == null) {
LOG.info("Failed to find " + sn +
" in list of online servers; skipping registration of open of " +
regionInfo.getRegionNameAsString());
} else {
new OpenedRegionHandler(master, this, regionInfo, hsi).process();
}
{code}
An additional check is present which makes things worse as the RIT is not able
to process the META in opened state and the system hangs for ever.
So in both the versions my idea was to modify the condition check
{code}
- } else if (isOnDeadServer(regionInfo, deadServers) &&
- !serverManager.isServerOnline(sn)) {
{code}
to
{code}
+ } else if (!serverManager.isServerOnline(sn)
+ && (isOnDeadServer(regionInfo, deadServers)
+ || regionInfo.isMetaRegion() || regionInfo.isRootRegion())) {
{code}
SO That in both the cases the META node can be forced to OFFLINE and a fresh
assignment can be done.
Testcases are running. Like to know your ideas if it is fine to do like this.
> .META. getting stuck if RS hosting it is dead and znode state is in
> RS_ZK_REGION_OPENED
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-4400
> URL: https://issues.apache.org/jira/browse/HBASE-4400
> Project: HBase
> Issue Type: Bug
> Reporter: ramkrishna.s.vasudevan
> Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4400_trunk.patch
>
>
> Start 2 RS.
> The .META. is being hosted by RS2 but while processing it goes down.
> Now restart the master and RS1. Master gets the RS name from the znode in
> RS_ZK_REGION_OPENED. But as RS2 is not online still the master is not able
> to process the META at all. Please find the logs
> {noformat}
> 2011-09-14 16:43:51,949 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENING, server=linux76,60020,1315998828523,
> region=70236052/-ROOT-
> 2011-09-14 16:43:51,968 INFO org.apache.hadoop.hbase.master.HMaster: -ROOT-
> assigned=1, rit=false, location=linux76:60020
> 2011-09-14 16:43:51,970 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Processing region
> .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-09-14 16:43:51,970 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Failed to find
> linux146,60020,1315998414623 in list of online servers; skipping registration
> of open of .META.,,1.1028785192
> 2011-09-14 16:43:51,971 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Waiting on 1028785192/.META.
> 2011-09-14 16:43:51,983 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENED, server=linux76,60020,1315998828523,
> region=70236052/-ROOT-
> 2011-09-14 16:43:51,986 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED
> event for 70236052; deleting unassigned node
> 2011-09-14 16:43:51,986 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x13267854032001d Deleting existing unassigned node for 70236052
> that is in expected state RS_ZK_REGION_OPENED
> 2011-09-14 16:43:51,998 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x13267854032001d Successfully deleted unassigned node for
> region 70236052 in expected state RS_ZK_REGION_OPENED
> 2011-09-14 16:43:51,999 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
> -ROOT-,,0.70236052 on linux76,60020,1315998828523
> 2011-09-14 16:44:00,945 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=linux146,60020,1315998839724, regionCount=0, userLoad=false
> 2011-09-14 16:46:20,003 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: .META.,,1.1028785192 state=OPEN, ts=0
> 2011-09-14 16:46:20,004 ERROR
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for
> too long, we don't know where region was opened so can't do anything
> {noformat}
> {code}
> regionsInTransition.put(encodedRegionName, new RegionState(
> regionInfo, RegionState.State.OPEN, data.getStamp()));
> ................
> } else {
> HServerInfo hsi = this.serverManager.getServerInfo(sn);
> if (hsi == null) {
> LOG.info("Failed to find " + sn +
> " in list of online servers; skipping registration of open of "
> +
> regionInfo.getRegionNameAsString());
> } else {
> new OpenedRegionHandler(master, this, regionInfo, hsi).process();
> }
> }
> {code}
> So timeout monitor is not able to do anything here
> {code}
> LOG.error("Region has been OPEN for too long, " +
> "we don't know where region was opened so can't do anything");
> synchronized(regionState) {
> regionState.update(regionState.getState());
> }
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira