[jira] [Commented] (HBASE-9268) Client doesn't recover from a stalled region server

Nicolas Liochon (JIRA) Thu, 22 Aug 2013 07:14:33 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747552#comment-13747552
 ]


Nicolas Liochon commented on HBASE-9268:
----------------------------------------

I tried the 0.94 on a pseudo cluster. It seems to work well 90% of the time 
(that is, I had a failure).
A possible explanation is that the writes won't block until the server side 
buffer is full (a side effect of kill -STOP: the socket stuff is done by the OS 
not the process), and that the 0.95 message size is bigger than the 0.94 (why 
would it be?). It's not very satisfying. The patch seems to work however, to 
there is a solution that makes sense even if I don't fully understand the 0.94 
scenario. I will spend more time on this.

The stack when it works on 0.94 is 
{noformat}
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:42395 
remote=sd-box/127.0.0.1:60020]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:373)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:646)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:580)
{noformat}
                
> Client doesn't recover from a stalled region server
> ---------------------------------------------------
>
>                 Key: HBASE-9268
>                 URL: https://issues.apache.org/jira/browse/HBASE-9268
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.95.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Nicolas Liochon
>             Fix For: 0.98.0, 0.95.3
>
>         Attachments: 9268-hack.patch
>
>
> Got this testing the 0.95.2 RC.
> I killed -STOP a region server and let it stay like that while running PE. 
> The clients didn't find the new region locations and in the jstack were stuck 
> doing RPC. Eventually I killed -CONT and the client printed these:
> bq. Exception in thread "TestClient-6" java.lang.RuntimeException: 
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 128 actions: IOException: 90 times, SocketTimeoutException: 38 times,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-9268) Client doesn't recover from a stalled region server

Reply via email to