Yuxuan Wang created THRIFT-5233:
-----------------------------------

             Summary: I/O timeout handling in go library
                 Key: THRIFT-5233
                 URL: https://issues.apache.org/jira/browse/THRIFT-5233
             Project: Thrift
          Issue Type: Improvement
          Components: Go - Compiler, Go - Library
    Affects Versions: 0.13.0
            Reporter: Yuxuan Wang


While debugging the bug in the first implementation of THRIFT-5214, I started 
to look into the rabbit hole of I/O timeouts in the go library (mainly the 
socket timeout on TSocket), and start to realize that the current way we handle 
it is not great.

>From client's perspective, how a request goes is:

client prepare the request TStruct -> send the request over the wire -> start 
reading the response over the wire -> handle the response or I/O error

The problem here, is that server will only send the first byte of response out 
after it finished processing the request. So if the client incorrectly 
configured a socket timeout on the TSocket used by this request to a value 
that's too low, the reading will just report i/o timeout when the deadline 
reaches, and the client will just fail the whole request.

This will cause problems:

# Client can't set socket timeout to something too low. In fact, if they use a 
client pool (so for most requests there's no overhead on connecting), they need 
the socket timeout to be at least the latency SLA of the server they talk to, 
otherwise there's a serious risk that the client will fail a lot of requests 
prematurely.
# On the other hand, setting the socket timeout to something too high is also a 
bad practice. In case that server is overloaded and cannot handle requests in a 
timely fashion, client should have a way to fail faster, instead of waiting for 
a very long timeout to expire.

If the client set a deadline on the context object with the call, their 
expectation usually would be that the request will fail after the deadline 
passed (a small overhead is usually acceptable). But the socket timeout is 
usually some fixed value configured to the program, while the actual deadline 
on the context is more variable (e.g. this could be a server talking to 
upstream server, it has a fixed totally deadline for the whole endpoint, but 
the steps before that might take variable time so the deadline left here can 
vary a lot).

This leads to the point that we would need a way to keep socket timeout and the 
deadline set in the context object in-sync.

There are a few different options. The first obvious one, is to pass the 
context object all the way to TSocket.Read, so that function can extract the 
deadline out of the context object (if any), and set that as the socket timeout 
instead of the pre-configured, fixed one.

But that is also easy to be ruled out, because that would change the function 
signature of TSocket.Read, making TSocket no longer implement io.Reader 
interface.

The next option, I think, is to handle it on TProtocol level. The idea is that 
we pass the context object all the way into TProtocol.ReadMessageBegin (and 
maybe other read functions of TProcotol?). This way, it can handle the read 
errors, check if it's a timeout error, and if it is and the context deadline 
haven't passed, it can just keep retrying again. This way, we can set the fixed 
socket timeout to a low number when we know we will always have a real deadline 
in the context attached, and just let TProtocol implementation to keep retrying 
instead. If the user don't attach a deadline to the context object, then they 
still need to set a large number on the socket timeout.

A slightly different option, is to add SetReadTimeout function to TTransport 
interface. So all the TTransport implementations wrapping TSocket should just 
pass that along. If the terminal one is actually not a TSocket (for example, 
it's TMemoryBuffer), then this function just does nothing. This way, the 
TProtocol.Read* functions can just extract deadline out of the context object, 
and call their TTransport.SetReadTimeout with that deadline.

Either way, this is going to be a breaking change that we need to add context 
object to TProtocol.Read* function signatures. As a result, I believe some 
compiler change might also be required to pass it all the way to TProtocol 
calls. But even if compiler changes are needed, that part should be minimal 
(just make sure we are passing context object correctly), and most of the 
changes would be on TProtocol implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to