[
https://issues.apache.org/jira/browse/HBASE-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058874#comment-13058874
]
stack commented on HBASE-4033:
------------------------------
Jieshan: Patch looks great. This piece looks a little odd though:
{code}
+ } else if (!this.assignmentManager.isServerOnline(serverName)) {
+ LOG.debug("The server is not in online servers, then close the region, "
+
+ "ServerName=" + serverName + ", region=" +
+ this.regionInfo.getEncodedName());
+ assignmentManager.unassign(regionInfo);
{code}
If we got an open region from a server that is not in the list of online
servers, its likely crashed? So, calling unassign doesn't seem right. If you
look in unassign, the first thing it does is check if the region is online.
It'll probably fail here:
{code}
synchronized (this.regions) {
// Check if this region is currently assigned
if (!regions.containsKey(region)) {
LOG.debug("Attempted to unassign region " +
region.getRegionNameAsString() + " but it is not " +
"currently assigned anywhere");
return;
}
}
{code}
Do you think we should call unassign here inside in openregionhandler?
> The shutdown RegionServer could be added to AssignmentManager.servers again
> ---------------------------------------------------------------------------
>
> Key: HBASE-4033
> URL: https://issues.apache.org/jira/browse/HBASE-4033
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.3
> Reporter: Jieshan Bean
> Fix For: 0.90.4
>
> Attachments: A_hbase-root-master-167-6-1-11.rar,
> HBASE-4033-90-V1.patch, HBASE-4033-trunk-V1.patch, analysis.gif,
> test-report.txt
>
>
> The folling steps can easily recreate the problem:
> 1. There's thousands of regions in the cluster.
> 2. Stop the cluster.
> 3. Start the cluster. Killing one regionserver while the regions were
> opening. Restarted it after 10 seconds.
> The shutted regionserver will appear in the AssignmentManager.servers list
> again.
> For example:
> Issue 1:
> 2011-06-23 14:14:30,775 DEBUG org.apache.hadoop.hbase.master.LoadBalancer:
> Server information: 167-6-1-12,20020,1308803390123=2220,
> 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2205,
> 167-6-1-13,20020,1308803514394=2183
> Two regionservers(One of it had aborted) had the same hostname but different
> startcode:
> 167-6-1-13,20020,1308803391742=2374
> 167-6-1-13,20020,1308803514394=2183
> Issue 2:
> (1).The Rs 167-6-1-11,20020,1308105402003 finished shutdown at "10:46:37,774":
> 10:46:37,774 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
> processing of shutdown of 167-6-1-11,20020,1308105402003
> (2).Overwriting happened, it seemed the RS was still exist in the set of
> AssignmentManager#regions:
> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager:
> Overwriting 612342de1fe4733f72299d70addb6d11 on
> serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0,
> usedHeap=0, maxHeap=0)
> (3).Region was assigned to this dead RS again at "10:50:20,671":
> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
> Assigning region
> Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to
> 167-6-1-11,20020,1308105402003
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira