[ 
https://issues.apache.org/jira/browse/HBASE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-18955:
-------------------------------
    Attachment: HBASE-18995.001.branch-1.1.patch

.001 Seems to have done the trick locally, but it's by no means "fast" (took 
roughly 250s as opposed to the normal 60s it takes when meta is not on the 
half-dead server).

* Needs a test case
* Needs verification against the existing tests

I'm still not convinced this is the right place to handle it. It's been too 
long since I've read the code around the RPC handling (with replicas); I think 
we wait for an operation timeout to happen instead of the first RPC timeout. 
I'm also not sure if catching CallTimeoutException is semantically the right 
thing to do, either.

> HBase client queries stale hbase:meta location with half-dead RegionServer
> --------------------------------------------------------------------------
>
>                 Key: HBASE-18955
>                 URL: https://issues.apache.org/jira/browse/HBASE-18955
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.1.12
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.1.13
>
>         Attachments: HBASE-18995.001.branch-1.1.patch
>
>
> Have been investigating a case with [~tedyu] where, when a RegionServer 
> becomes "hung" (for no specific reason -- not the point), the client becomes 
> stuck trying to talk to this RegionServer, never exiting. This was eventually 
> tracked down to HBASE-15645. However, in testing the fix, I found that there 
> is an additional problem which only affects branch-1.1.
> When the RegionServer in the "half-dead" state is also hosting meta, the 
> hbase client (both the one trying to read data, but also the client in the 
> Master trying to read meta in SSH) get stuck repeatedly trying to read meta 
> from the old location after meta has been reassigned.
> The general test outline goes like this:
> * Start at least 2 regionservers
> * Load some data into a table ({{hbase pe}} is great)
> * Find a region that is hosted by the same RS that is hosting meta
> * {{kill -SIGSTOP}} that RS hosting the user region and meta
> * Issue a {{get}} in the hbase-shell trying to read from that user region
> The expectation is that the ZK lock will expire for the STOP'ed RS, meta will 
> be reassigned, then the user regions will be reassigned, then the client will 
> get the result of the get without seeing an error (as long as this happens 
> within the hbase.client.operation.timeout value, of course).
> We see this happening on HBase 1.2.4 and 1.3.2-SNAPSHOT, but, on 
> 1.1.13-SNAPSHOT, the Master gets up to re-assigning meta, but then gets stuck 
> trying to read meta from the STOP'ed RS instead of where it re-assigned it. 
> Because of this, the regions stay in transition until the master is restarted 
> or the STOP'ed RS is CONT'ed. My best guess is that when the RS sees the 
> {{SIGCONT}}, it immediately begins stopping which is enough to kick the 
> client into refreshing the region location cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to