Jeongdae Kim created HBASE-22538:
------------------------------------

             Summary: Prevent graceful_stop.sh from shutting down RS too early 
before finishing unloading regions
                 Key: HBASE-22538
                 URL: https://issues.apache.org/jira/browse/HBASE-22538
             Project: HBase
          Issue Type: Bug
          Components: shell
    Affects Versions: 1.4.9
            Reporter: Jeongdae Kim
            Assignee: Jeongdae Kim


We can stop or restart region servers gracefully using graceful_stop.sh command
This command should guarantee that all regions are moved out before shutting 
down a region server.

However, sometimes i saw many requests failed while restarting a region server 
with this command in our production clusters(v1.2.5)
affected clients got many RegionServerStoppedExceptions and exhausted retry 
count.

I found it took 0.03 sec to move a region, it’s too fast. and, 
moving(unloading) regions in the region server wasn’t finished, even didn’t 
closed yet when region server got shutdown signal.
Because a region server serving regions (didn't be closed) were stopped, 
clients got many exception (RegionServerStoppedException)

But, region_mover should wait until a region is served by other region 
server(meta changed)
https://github.com/apache/hbase/blob/branch-1.2/bin/region_mover.rb#L153

I figured out why this early shutdown happened. 
a) our clusters use upper case hostname
b) region server makes ServerName with lowercase hostname, and it will be sent 
to the master
https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L542
c) when updating meta, server name will keep its own case
https://github.com/apache/hbase/blob/branch-1.2/hbase-client/src/main/java/org/apache/hadoop/hbase/MetaTableAccessor.java#L1527
d) region_mover.rb just compare b) and c), so it is always false
https://github.com/apache/hbase/blob/branch-1.2/bin/region_mover.rb#L91
https://github.com/apache/hbase/blob/branch-1.2/bin/region_mover.rb#L52

I think region_mover should compare server name between master and meta with 
the same case(lower)

With patch, I confirmed region_mover waited until finishing moving all regions, 
then triggered shutting down region sever. (also observed only 
RegionMovedException before shutdown log, and no exception after starting 
shutdown)




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to