[jira] [Updated] (HBASE-13330) Region left unassigned due to AM & SSH each thinking the assignment would be done by the other

Sean Busbey (JIRA) Thu, 02 Jul 2015 10:01:16 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Busbey updated HBASE-13330:
--------------------------------
    Fix Version/s:     (was: 1.2.0)
                   1.2.1
           Status: Patch Available  (was: Open)

> Region left unassigned due to AM & SSH each thinking the assignment would be 
> done by the other
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13330
>                 URL: https://issues.apache.org/jira/browse/HBASE-13330
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 1.0.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 2.0.0, 1.0.2, 1.1.2, 1.3.0, 1.2.1
>
>         Attachments: 13330-branch-1.txt
>
>
> Here is what I found during analysis of an issue. Raising this jira and a fix 
> will follow.
> The TL;DR of this is that the AssignmentManager thinks the 
> ServerShutdownHandler would assign the region and the ServerShutdownHandler 
> thinks that the AssignmentManager would assign the region. The region 
> (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is 
> an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to 
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> 2. When the master restarted it did a scan of the meta to learn about the 
> regions in the cluster. It found this region being assigned to 
> dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record.
> 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was 
> not alive anymore. So, the AssignmentManager queued up a 
> ServerShutdownHandling task for this (that asynchronously executes):
> {noformat}
> 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers,
>  submitted shutdown handler to be executed meta=false
> {noformat}
> 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found 
> this region as well:
> {noformat}
> 2015-03-06 14:09:31,527 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Processing 
> 0d6cf37c18c54c6f4744750c6a7be837
> in state: RS_ZK_REGION_FAILED_OPEN
> {noformat}
> 5. The region was moved to CLOSED state:
> {noformat}
> 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates: 
> 0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on
> dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected 
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This 
> means that the region was assigned to 
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't 
> open the region for some reason, and it changed the state to 
> RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK.
> 6. After that the AssignmentManager tried to assign it again. However, the 
> assignment didn't happen because the ServerShutdownHandling task queued 
> earlier didn't yet execute:
> {noformat}
> 2015-03-06 14:09:31,527 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning 
> phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837.,
>  it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not 
> processed yet
> {noformat}
> 7. Eventually the ServerShutdownHandling task executed.
> {noformat}
> 2015-03-06 14:09:35,188 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
> for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment.
> 2015-03-06 14:09:35,209 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19 
> region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was
>  carrying (and 0 regions(s) that were opening on this server)
> 2015-03-06 14:09:35,211 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> 8. However, the ServerShutdownHandling task skipped the region in question. 
> This was because this region was in RIT, and the ServerShutdownHandling task 
> thinks that the AssignmentManager would assign it as part of handling the RIT 
> nodes:
> {noformat}
> 2015-03-06 14:09:35,210 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning 
> region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527, 
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
> 9. At some point in the future, when the server 
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the 
> ServerShutdownHandling for it gets queued up (from the log 
> hbase-hbase-master-dnj1-bcpc-r3n1.log):
> {noformat}
> 2015-03-09 11:35:10,607 INFO 
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral 
> node deleted,
> processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259]
> {noformat}
> 10. In RegionStates.java:serverOffline, there is a check that happens on the 
> state of the region's state. Since the region is in CLOSED state, the log is 
> displayed:
> {noformat}
> 2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates: 
> THIS SHOULD NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837 
> state=CLOSED, ts=1425668971527, 
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13330) Region left unassigned due to AM & SSH each thinking the assignment would be done by the other

Reply via email to