Devaraj Das created HBASE-13330:
-----------------------------------

             Summary: Region left unassigned due to AM & SSH each thinking the 
assignment would be done by the other
                 Key: HBASE-13330
                 URL: https://issues.apache.org/jira/browse/HBASE-13330
             Project: HBase
          Issue Type: Bug
          Components: master, Region Assignment
            Reporter: Devaraj Das
             Fix For: 1.1.0


Here is what I found during analysis of an issue. Raising this jira and a fix 
will follow.
The TL;DR of this is that the AssignmentManager thinks the 
ServerShutdownHandler would assign the region and the ServerShutdownHandler 
thinks that the AssignmentManager would assign the region. The region 
(0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is an 
analysis from the logs that captures the flow of events.

1. The AssignmentManager had initially assigned this region to 
dnj1-bcpc-r3n8.example.com,60020,1425598187703
2. When the master restarted it did a scan of the meta to learn about the 
regions in the cluster. It found this region being assigned to 
dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record.
3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was 
not alive anymore. So, the AssignmentManager queued up a ServerShutdownHandling 
task for this (that asynchronously executes):
{noformat}
2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers,
 submitted shutdown handler to be executed meta=false
{noformat}

4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found this 
region as well:
{noformat}
2015-03-06 14:09:31,527 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Processing 0d6cf37c18c54c6f4744750c6a7be837
in state: RS_ZK_REGION_FAILED_OPEN
{noformat}

5. The region was moved to CLOSED state:
{noformat}
2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates: 
0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on
dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected 
dnj1-bcpc-r3n8.example.com,60020,1425598187703
{noformat}
Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This 
means that the region was assigned to 
dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't 
open the region for some reason, and it changed the state to 
RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK.

6. After that the AssignmentManager tried to assign it again. However, the 
assignment didn't happen because the ServerShutdownHandling task queued earlier 
didn't yet execute:
{noformat}
2015-03-06 14:09:31,527 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Skip assigning 
phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837.,
 it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not 
processed yet
{noformat}

7. Eventually the ServerShutdownHandling task executed.
{noformat}
2015-03-06 14:09:35,188 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment.
2015-03-06 14:09:35,209 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19 
region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was
 carrying (and 0 regions(s) that were opening on this server)
2015-03-06 14:09:35,211 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703
{noformat}

8. However, the ServerShutdownHandling task skipped the region in question. 
This was because this region was in RIT, and the ServerShutdownHandling task 
thinks that the AssignmentManager would assign it as part of handling the RIT 
nodes:
{noformat}
2015-03-06 14:09:35,210 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning 
region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837
state=CLOSED, ts=1425668971527, 
server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
{noformat}

9. At some point in the future, when the server 
dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the ServerShutdownHandling 
for it gets queued up (from the log hbase-hbase-master-dnj1-bcpc-r3n1.log):
{noformat}
2015-03-09 11:35:10,607 INFO 
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral 
node deleted,
processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259]
{noformat}

10. In RegionStates.java:serverOffline, there is a check that happens on the 
state of the region's state. Since the region is in CLOSED state, the log is 
displayed:
{noformat}
2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates: THIS 
SHOULD NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837 state=CLOSED, 
ts=1425668971527, server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to