Bjorn Olsen created HBASE-17963:
-----------------------------------

             Summary: RegionServers lose file locality on unplanned restart
                 Key: HBASE-17963
                 URL: https://issues.apache.org/jira/browse/HBASE-17963
             Project: HBase
          Issue Type: Bug
          Components: hbase
    Affects Versions: 1.1.2
         Environment: Evident with HDP 2.4.3 running HBase 1.1.2
            Reporter: Bjorn Olsen


When an HBase cluster crashes, HFile locality is lost. 

Crashes can happen for a variety of reasons, and in this event having a quick 
time to recover (both data and database performance) is critical. 

On cluster restore, region servers do not load their previous set of regions, 
which means all HFiles must be moved around until locality is achieved again. 
Performance is poor while file locality is not close to 100%. 
A major compaction must be run to move the regions around, which further 
impacts performance and will take longer the more data was in HBase at the time 
of the crash.

There is a graceful_stop script which is useful for planned outages - you can 
first unload the regions from the region server, restart it, and then reload 
the regions to the same server. No HFiles need to be moved and file locality is 
quickly restored.

However, with an unplanned outage, there is no locality kept of where the 
regions were. On a crash HBase randomly assigns regions to region servers and 
HFile locality is very low. We then need to move all the HFiles around until 
file locality is restored.

This is fine for a small number of regions and small HFiles but becomes 
problematic when you have a large number of region servers or large files.

This JIRA is a request to improve this behavior for unplanned outages by trying 
to restore the regions assigned per server, after a cluster restart. 

For example, HBase could keep a list of the region locality at regular 
intervals, and use this as an initial guideline when regions are restarted. 
Locality might still not be 100% immediately - but presumably better than 0%. 
It would be necessary to first disable the load balancer (if enabled) while 
this restore is happening and enable it afterward.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to