[
https://issues.apache.org/jira/browse/KUDU-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752602#comment-16752602
]
Will Berkeley commented on KUDU-1868:
-------------------------------------
Sort of. The socket read timeout is a property of the connection, so it applies
to all RPCs that go over the connection. Ergo, if a client is using different
timeouts for different operations, we can't match the socket read timeout to
all the different timeouts. But if the same timeout is set on all RPCs then
setting the socket read timeout to that timeout is reasonable.
The Java client depends on having a socket read timeout to handle timeouts when
the server hangs and doesn't respond. Without it, if a server just doesn't
respond to an rpc, that rpc will hang forever. I have a test showing this. I'm
not 100% sure what happens if there is parallel traffic on one connection- if
one RPC is hanging, but others are passing on the connection and resetting the
last recv time, I assume the hanging RPC will not time out. I should enhance my
test to show this.
We should instead have a mechanism to track each call's timeout and actively
time it out when it times out, instead of relying on an event in the netty
Channel. This might be tricky in practice because one operation that looks like
one RPC and that has one timeout may actually be a series of RPCs of unknown
length going to different servers.
> Java client mishandles socket read timeouts for scans
> -----------------------------------------------------
>
> Key: KUDU-1868
> URL: https://issues.apache.org/jira/browse/KUDU-1868
> Project: Kudu
> Issue Type: Bug
> Components: client
> Affects Versions: 1.2.0
> Reporter: Jean-Daniel Cryans
> Assignee: Will Berkeley
> Priority: Major
>
> Scan calls from the Java client that take more than the socket read timeout
> get retried (unless the operation timeout has expired) instead of being
> killed. Users will see this:
> {code}
> org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID in
> scan request
> {code}
> Note that the right behavior here would still end up killing the scanner, so
> this is really a problem the user has to deal with! It's usually caused by
> slow IO, combined with very selection scans.
> Workaround: set defaultSocketReadTimeoutMs higher, ideally equal to
> defaultOperationTimeoutMs (the defaults are 10 and 30 seconds respectively).
> But really the user should investigate why single the scans are so slow.
> One potentially easy fix to this is to handle retries differently for
> scanners so that the user gets nicer exception. A harder fix is to handle
> socket read timeouts completely differently, basically it should be per-RPC
> and not per TabletClient like it is right now.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)