[ 
https://issues.apache.org/jira/browse/HBASE-14458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Samir Ahmic updated HBASE-14458:
--------------------------------
    Fix Version/s: 2.0.0
           Status: Patch Available  (was: Open)

> AsyncRpcClient#createRpcChannel() should check and remove dead channel before 
> creating new one to same server
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14458
>                 URL: https://issues.apache.org/jira/browse/HBASE-14458
>             Project: HBase
>          Issue Type: Bug
>          Components: IPC/RPC
>    Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.1.3
>            Reporter: Samir Ahmic
>            Assignee: Samir Ahmic
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-14458.patch
>
>
> I have notice this issue while testing master branch in distributed mode. 
> Reproduction steps:
> 1. Write some data with hbase ltt 
> 2. While ltt is writing execute $graceful_stop.sh --restart --reload [rs] 
> 3. Wait until script start to reload regions to restarted server. In that 
> moment ltt will stop writing and eventually fail. 
> After some digging i have notice that while ltt is working correctly there is 
> single connection per regionserver (lsof for single connection, 27109 is  ltt 
> PID )
> {code}
> java      27109   hbase  143u    210579579      0t0        TCP 
> hnode1:40423->hnode5:16020 (ESTABLISHED)
> {code}  
> and when in this example hnode5 server is restarted and script starts to 
> reload regions on this server ltt start creating thousands of new tcp 
> connections to this server:
> {code}
> java      27109   hbase *623u              210674415      0t0        TCP 
> hnode1:52948->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *624u               210674416      0t0        TCP 
> hnode1:52949->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *625u               210674417      0t0        TCP 
> hnode1:52950->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *627u               210674419      0t0        TCP 
> hnode1:52952->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *628u               210674420      0t0        TCP 
> hnode1:52953->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *633u               210674425      0t0        TCP 
> hnode1:52958->hnode5:16020 (ESTABLISHED)
> ...
> {code}
> So here is what happened based on some additional logging and debugging:
> - AsyncRpcClient never detected that regionserver is restarted because 
> regions were moved and there was no write/read requests to this server and  
> there is no some sort of heart-bit mechanism implemented
> -  because of above dead {code}AsyncRpcChannel{code} stayed in 
> {code}PoolMap<Integer, AsyncRpcChannel> connections{code}
> - when ltt detected that regions are moved back to hnode5  it tried to 
> reconnect to hnode5  leading this issue
> I was able to resolve this issue by adding following to 
> AsyncRpcClient#createRpcChannel():
> {code}
> synchronized (connections) {
>       if (closed) {
>         throw new StoppedRpcClientException();
>       }
>       rpcChannel = connections.get(hashCode);
> +    if (rpcChannel != null && !rpcChannel.isAlive()) {
> +        LOG.debug(Removing dead channel from "+ 
> rpcChannel.address.toString());
> +        connections.remove(hashCode);
> +      }      
>       if (rpcChannel == null || !rpcChannel.isAlive()) {
>         rpcChannel = new AsyncRpcChannel(this.bootstrap, this, ticket, 
> serviceName, location);
>         connections.put(hashCode, rpcChannel);
> {code}
>  I will attach patch after some more testing.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to