[ https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Busbey updated HBASE-13330: -------------------------------- Fix Version/s: (was: 1.2.0) 1.2.1 Status: Patch Available (was: Open) > Region left unassigned due to AM & SSH each thinking the assignment would be > done by the other > ---------------------------------------------------------------------------------------------- > > Key: HBASE-13330 > URL: https://issues.apache.org/jira/browse/HBASE-13330 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment > Affects Versions: 1.0.0 > Reporter: Devaraj Das > Assignee: Devaraj Das > Fix For: 2.0.0, 1.0.2, 1.1.2, 1.3.0, 1.2.1 > > Attachments: 13330-branch-1.txt > > > Here is what I found during analysis of an issue. Raising this jira and a fix > will follow. > The TL;DR of this is that the AssignmentManager thinks the > ServerShutdownHandler would assign the region and the ServerShutdownHandler > thinks that the AssignmentManager would assign the region. The region > (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is > an analysis from the logs that captures the flow of events. > 1. The AssignmentManager had initially assigned this region to > dnj1-bcpc-r3n8.example.com,60020,1425598187703 > 2. When the master restarted it did a scan of the meta to learn about the > regions in the cluster. It found this region being assigned to > dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record. > 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was > not alive anymore. So, the AssignmentManager queued up a > ServerShutdownHandling task for this (that asynchronously executes): > {noformat} > 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers, > submitted shutdown handler to be executed meta=false > {noformat} > 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found > this region as well: > {noformat} > 2015-03-06 14:09:31,527 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Processing > 0d6cf37c18c54c6f4744750c6a7be837 > in state: RS_ZK_REGION_FAILED_OPEN > {noformat} > 5. The region was moved to CLOSED state: > {noformat} > 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates: > 0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on > dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected > dnj1-bcpc-r3n8.example.com,60020,1425598187703 > {noformat} > Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This > means that the region was assigned to > dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't > open the region for some reason, and it changed the state to > RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK. > 6. After that the AssignmentManager tried to assign it again. However, the > assignment didn't happen because the ServerShutdownHandling task queued > earlier didn't yet execute: > {noformat} > 2015-03-06 14:09:31,527 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning > phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837., > it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not > processed yet > {noformat} > 7. Eventually the ServerShutdownHandling task executed. > {noformat} > 2015-03-06 14:09:35,188 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs > for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment. > 2015-03-06 14:09:35,209 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19 > region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was > carrying (and 0 regions(s) that were opening on this server) > 2015-03-06 14:09:35,211 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished > processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703 > {noformat} > 8. However, the ServerShutdownHandling task skipped the region in question. > This was because this region was in RIT, and the ServerShutdownHandling task > thinks that the AssignmentManager would assign it as part of handling the RIT > nodes: > {noformat} > 2015-03-06 14:09:35,210 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning > region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837 > state=CLOSED, ts=1425668971527, > server=dnj1-bcpc-r3n2.example.com,60020,1425603618259} > {noformat} > 9. At some point in the future, when the server > dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the > ServerShutdownHandling for it gets queued up (from the log > hbase-hbase-master-dnj1-bcpc-r3n1.log): > {noformat} > 2015-03-09 11:35:10,607 INFO > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral > node deleted, > processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259] > {noformat} > 10. In RegionStates.java:serverOffline, there is a check that happens on the > state of the region's state. Since the region is in CLOSED state, the log is > displayed: > {noformat} > 2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates: > THIS SHOULD NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837 > state=CLOSED, ts=1425668971527, > server=dnj1-bcpc-r3n2.example.com,60020,1425603618259} > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)