[
https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk updated HBASE-13330:
---------------------------------
Fix Version/s: (was: 1.1.4)
1.1.3
Updating fix version based on release audit vs what's committed.
> Region left unassigned due to AM & SSH each thinking the assignment would be
> done by the other
> ----------------------------------------------------------------------------------------------
>
> Key: HBASE-13330
> URL: https://issues.apache.org/jira/browse/HBASE-13330
> Project: HBase
> Issue Type: Bug
> Components: master, Region Assignment
> Affects Versions: 1.0.0
> Reporter: Devaraj Das
> Assignee: Devaraj Das
> Fix For: 1.2.0, 1.3.0, 1.0.3, 1.1.3, 0.98.16
>
> Attachments: 13330-branch-1.txt, 13330-v2-branch-1.txt,
> 13330-v3-branch-1.txt
>
>
> Here is what I found during analysis of an issue. Raising this jira and a fix
> will follow.
> The TL;DR of this is that the AssignmentManager thinks the
> ServerShutdownHandler would assign the region and the ServerShutdownHandler
> thinks that the AssignmentManager would assign the region. The region
> (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is
> an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> 2. When the master restarted it did a scan of the meta to learn about the
> regions in the cluster. It found this region being assigned to
> dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record.
> 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was
> not alive anymore. So, the AssignmentManager queued up a
> ServerShutdownHandling task for this (that asynchronously executes):
> {noformat}
> 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers,
> submitted shutdown handler to be executed meta=false
> {noformat}
> 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found
> this region as well:
> {noformat}
> 2015-03-06 14:09:31,527 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Processing
> 0d6cf37c18c54c6f4744750c6a7be837
> in state: RS_ZK_REGION_FAILED_OPEN
> {noformat}
> 5. The region was moved to CLOSED state:
> {noformat}
> 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates:
> 0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on
> dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This
> means that the region was assigned to
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't
> open the region for some reason, and it changed the state to
> RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK.
> 6. After that the AssignmentManager tried to assign it again. However, the
> assignment didn't happen because the ServerShutdownHandling task queued
> earlier didn't yet execute:
> {noformat}
> 2015-03-06 14:09:31,527 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning
> phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837.,
> it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not
> processed yet
> {noformat}
> 7. Eventually the ServerShutdownHandling task executed.
> {noformat}
> 2015-03-06 14:09:35,188 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
> for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment.
> 2015-03-06 14:09:35,209 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19
> region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was
> carrying (and 0 regions(s) that were opening on this server)
> 2015-03-06 14:09:35,211 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
> processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> 8. However, the ServerShutdownHandling task skipped the region in question.
> This was because this region was in RIT, and the ServerShutdownHandling task
> thinks that the AssignmentManager would assign it as part of handling the RIT
> nodes:
> {noformat}
> 2015-03-06 14:09:35,210 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning
> region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527,
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
> 9. At some point in the future, when the server
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the
> ServerShutdownHandling for it gets queued up (from the log
> hbase-hbase-master-dnj1-bcpc-r3n1.log):
> {noformat}
> 2015-03-09 11:35:10,607 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral
> node deleted,
> processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259]
> {noformat}
> 10. In RegionStates.java:serverOffline, there is a check that happens on the
> state of the region's state. Since the region is in CLOSED state, the log is
> displayed:
> {noformat}
> 2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates:
> THIS SHOULD NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527,
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)