Yuxuan Wang created THRIFT-5233:
-----------------------------------
Summary: I/O timeout handling in go library
Key: THRIFT-5233
URL: https://issues.apache.org/jira/browse/THRIFT-5233
Project: Thrift
Issue Type: Improvement
Components: Go - Compiler, Go - Library
Affects Versions: 0.13.0
Reporter: Yuxuan Wang
While debugging the bug in the first implementation of THRIFT-5214, I started
to look into the rabbit hole of I/O timeouts in the go library (mainly the
socket timeout on TSocket), and start to realize that the current way we handle
it is not great.
>From client's perspective, how a request goes is:
client prepare the request TStruct -> send the request over the wire -> start
reading the response over the wire -> handle the response or I/O error
The problem here, is that server will only send the first byte of response out
after it finished processing the request. So if the client incorrectly
configured a socket timeout on the TSocket used by this request to a value
that's too low, the reading will just report i/o timeout when the deadline
reaches, and the client will just fail the whole request.
This will cause problems:
# Client can't set socket timeout to something too low. In fact, if they use a
client pool (so for most requests there's no overhead on connecting), they need
the socket timeout to be at least the latency SLA of the server they talk to,
otherwise there's a serious risk that the client will fail a lot of requests
prematurely.
# On the other hand, setting the socket timeout to something too high is also a
bad practice. In case that server is overloaded and cannot handle requests in a
timely fashion, client should have a way to fail faster, instead of waiting for
a very long timeout to expire.
If the client set a deadline on the context object with the call, their
expectation usually would be that the request will fail after the deadline
passed (a small overhead is usually acceptable). But the socket timeout is
usually some fixed value configured to the program, while the actual deadline
on the context is more variable (e.g. this could be a server talking to
upstream server, it has a fixed totally deadline for the whole endpoint, but
the steps before that might take variable time so the deadline left here can
vary a lot).
This leads to the point that we would need a way to keep socket timeout and the
deadline set in the context object in-sync.
There are a few different options. The first obvious one, is to pass the
context object all the way to TSocket.Read, so that function can extract the
deadline out of the context object (if any), and set that as the socket timeout
instead of the pre-configured, fixed one.
But that is also easy to be ruled out, because that would change the function
signature of TSocket.Read, making TSocket no longer implement io.Reader
interface.
The next option, I think, is to handle it on TProtocol level. The idea is that
we pass the context object all the way into TProtocol.ReadMessageBegin (and
maybe other read functions of TProcotol?). This way, it can handle the read
errors, check if it's a timeout error, and if it is and the context deadline
haven't passed, it can just keep retrying again. This way, we can set the fixed
socket timeout to a low number when we know we will always have a real deadline
in the context attached, and just let TProtocol implementation to keep retrying
instead. If the user don't attach a deadline to the context object, then they
still need to set a large number on the socket timeout.
A slightly different option, is to add SetReadTimeout function to TTransport
interface. So all the TTransport implementations wrapping TSocket should just
pass that along. If the terminal one is actually not a TSocket (for example,
it's TMemoryBuffer), then this function just does nothing. This way, the
TProtocol.Read* functions can just extract deadline out of the context object,
and call their TTransport.SetReadTimeout with that deadline.
Either way, this is going to be a breaking change that we need to add context
object to TProtocol.Read* function signatures. As a result, I believe some
compiler change might also be required to pass it all the way to TProtocol
calls. But even if compiler changes are needed, that part should be minimal
(just make sure we are passing context object correctly), and most of the
changes would be on TProtocol implementations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)