[jira] [Commented] (KUDU-1868) Java client mishandles socket read timeouts for scans

Will Berkeley (JIRA) Fri, 25 Jan 2019 11:31:54 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752602#comment-16752602
 ]


Will Berkeley commented on KUDU-1868:
-------------------------------------

Sort of. The socket read timeout is a property of the connection, so it applies 
to all RPCs that go over the connection. Ergo, if a client is using different 
timeouts for different operations, we can't match the socket read timeout to 
all the different timeouts. But if the same timeout is set on all RPCs then 
setting the socket read timeout to that timeout is reasonable.

The Java client depends on having a socket read timeout to handle timeouts when 
the server hangs and doesn't respond. Without it, if a server just doesn't 
respond to an rpc, that rpc will hang forever. I have a test showing this. I'm 
not 100% sure what happens if there is parallel traffic on one connection- if 
one RPC is hanging, but others are passing on the connection and resetting the 
last recv time, I assume the hanging RPC will not time out. I should enhance my 
test to show this.

We should instead have a mechanism to track each call's timeout and actively 
time it out when it times out, instead of relying on an event in the netty 
Channel. This might be tricky in practice because one operation that looks like 
one RPC and that has one timeout may actually be a series of RPCs of unknown 
length going to different servers.

> Java client mishandles socket read timeouts for scans
> -----------------------------------------------------
>
>                 Key: KUDU-1868
>                 URL: https://issues.apache.org/jira/browse/KUDU-1868
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.2.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Will Berkeley
>            Priority: Major
>
> Scan calls from the Java client that take more than the socket read timeout 
> get retried (unless the operation timeout has expired) instead of being 
> killed. Users will see this:
> {code}
> org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID in 
> scan request
> {code}
> Note that the right behavior here would still end up killing the scanner, so 
> this is really a problem the user has to deal with! It's usually caused by 
> slow IO, combined with very selection scans.
> Workaround: set defaultSocketReadTimeoutMs higher, ideally equal to 
> defaultOperationTimeoutMs (the defaults are 10 and 30 seconds respectively). 
> But really the user should investigate why single the scans are so slow.
> One potentially easy fix to this is to handle retries differently for 
> scanners so that the user gets nicer exception. A harder fix is to handle 
> socket read timeouts completely differently, basically it should be per-RPC 
> and not per TabletClient like it is right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-1868) Java client mishandles socket read timeouts for scans

Reply via email to