[ https://issues.apache.org/jira/browse/HBASE-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gavin updated HBASE-17963: -------------------------- Comment: was deleted (was: A comment with security level 'jira-users' was removed.) > RegionServers lose file locality on unplanned restart > ----------------------------------------------------- > > Key: HBASE-17963 > URL: https://issues.apache.org/jira/browse/HBASE-17963 > Project: HBase > Issue Type: Bug > Affects Versions: 1.1.2 > Environment: Evident with HDP 2.4.3 running HBase 1.1.2 > Reporter: Bjorn Olsen > Priority: Major > > When an HBase cluster crashes, HFile locality is lost. > Crashes can happen for a variety of reasons, and in this event having a quick > time to recover (both data and database performance) is critical. > On cluster restore, region servers do not load their previous set of regions, > which means all HFiles must be moved around until locality is achieved again. > Performance is poor while file locality is not close to 100%. > A major compaction must be run to move the regions around, which further > impacts performance and will take longer the more data was in HBase at the > time of the crash. > There is a graceful_stop script which is useful for planned outages - you can > first unload the regions from the region server, restart it, and then reload > the regions to the same server. No HFiles need to be moved and file locality > is quickly restored. > However, with an unplanned outage, there is no locality kept of where the > regions were. On a crash HBase randomly assigns regions to region servers and > HFile locality is very low. We then need to move all the HFiles around until > file locality is restored. > This is fine for a small number of regions and small HFiles but becomes > problematic when you have a large number of region servers or large files. > This JIRA is a request to improve this behavior for unplanned outages by > trying to restore the regions assigned per server, after a cluster restart. > For example, HBase could keep a list of the region locality at regular > intervals, and use this as an initial guideline when regions are restarted. > Locality might still not be 100% immediately - but presumably better than 0%. > It would be necessary to first disable the load balancer (if enabled) while > this restore is happening and enable it afterward. -- This message was sent by Atlassian JIRA (v7.6.3#76005)