[jira] [Commented] (HBASE-9268) Client doesn't recover from a stalled region server

Nicolas Liochon (JIRA) Tue, 20 Aug 2013 10:17:49 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745145#comment-13745145
 ]


Nicolas Liochon commented on HBASE-9268:
----------------------------------------

I've played with a pseudo distributed cluster + ycsb and got this when I kill 
-STOP the regionserver:
{noformat}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        - waiting to lock <0x00000007de6af410****** > (a 
java.io.BufferedOutputStream)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:232)
        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:248)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.close(RpcClient.java:963)
        - locked <0x00000007de6ab808> (a 
org.apache.hadoop.hbase.ipc.RpcClient$Connection)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:718)

"hbase-table-pool-1-thread-6" daemon prio=10 tid=0x00007f93000ce800 nid=0x649c 
runnable [0x00007f932aa85000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x00000007deb3ff60> (a sun.nio.ch.Util$2)
        - locked <0x00000007deb3ff50> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000007deb3fd48> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        - locked <0x00000007de6af410******> (a java.io.BufferedOutputStream)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        - locked <0x00000007de6af3f0> (a java.io.DataOutputStream)
        at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:230)
        at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:220)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1039)
        - locked <0x00000007de6af3f0> (a java.io.DataOutputStream)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1407)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1635)
{noformat}


It's exactly like if the timeout on the socket was not set. Strange.
                
> Client doesn't recover from a stalled region server
> ---------------------------------------------------
>
>                 Key: HBASE-9268
>                 URL: https://issues.apache.org/jira/browse/HBASE-9268
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.95.2
>            Reporter: Jean-Daniel Cryans
>             Fix For: 0.98.0, 0.95.3
>
>
> Got this testing the 0.95.2 RC.
> I killed -STOP a region server and let it stay like that while running PE. 
> The clients didn't find the new region locations and in the jstack were stuck 
> doing RPC. Eventually I killed -CONT and the client printed these:
> bq. Exception in thread "TestClient-6" java.lang.RuntimeException: 
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 128 actions: IOException: 90 times, SocketTimeoutException: 38 times,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-9268) Client doesn't recover from a stalled region server

Reply via email to