Josh Elser created HBASE-18955:
----------------------------------

             Summary: HBase client queries stale hbase:meta location with 
half-dead RegionServer
                 Key: HBASE-18955
                 URL: https://issues.apache.org/jira/browse/HBASE-18955
             Project: HBase
          Issue Type: Bug
          Components: Client
    Affects Versions: 1.1.12
            Reporter: Josh Elser
            Assignee: Josh Elser
            Priority: Critical
             Fix For: 1.1.13


Have been investigating a case with [~tedyu] where, when a RegionServer becomes 
"hung" (for no specific reason -- not the point), the client becomes stuck 
trying to talk to this RegionServer, never exiting. This was eventually tracked 
down to HBASE-15645. However, in testing the fix, I found that there is an 
additional problem which only affects branch-1.1.

When the RegionServer in the "half-dead" state is also hosting meta, the hbase 
client (both the one trying to read data, but also the client in the Master 
trying to read meta in SSH) get stuck repeatedly trying to read meta from the 
old location after meta has been reassigned.

The general test outline goes like this:

* Start at least 2 regionservers
* Load some data into a table ({{hbase pe}} is great)
* Find a region that is hosted by the same RS that is hosting meta
* {{kill -SIGSTOP}} that RS hosting the user region and meta
* Issue a {{get}} in the hbase-shell trying to read from that user region

The expectation is that the ZK lock will expire for the STOP'ed RS, meta will 
be reassigned, then the user regions will be reassigned, then the client will 
get the result of the get without seeing an error (as long as this happens 
within the hbase.client.operation.timeout value, of course).

We see this happening on HBase 1.2.4 and 1.3.2-SNAPSHOT, but, on 
1.1.13-SNAPSHOT, the Master gets up to re-assigning meta, but then gets stuck 
trying to read meta from the STOP'ed RS instead of where it re-assigned it. 
Because of this, the regions stay in transition until the master is restarted 
or the STOP'ed RS is CONT'ed. My best guess is that when the RS sees the 
{{SIGCONT}}, it immediately begins stopping which is enough to kick the client 
into refreshing the region location cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to