Todd Lipcon has posted comments on this change.

Change subject: KUDU-1409. Make krpc call timeouts more resistant to process or 
reactor pauses
......................................................................


Patch Set 3:

> Pretty clever solution, though I wonder how widespread this issue is outside 
> of Impala. Just to make sure I understand, this is only a mitigation 
> mechanism at best, because it can only reduce the likelihood of process 
> pauses triggering Kudu timeouts, right? That is, a pause during the second 
> stage will still result in a timeout even if the RPC result is sitting in the 
> socket.

Right, if you have a 5 second timeout, and the actual response arrived at 4.9 
seconds, and the client happened to pause between 4.8 seconds and 5.2 seconds, 
you'd still get a "false" timeout. But, this patch means that you won't ever 
get false timeouts due to pauses in the first 90% of the allotted time.

Given people usually set timeouts to be much longer than the expected response 
time, I think this is likely to solve most of the problems.

Outside of Impala, I've also seen timeouts in various tests due to reactor 
threads blocked in glog. EG see https://issues.apache.org/jira/browse/KUDU-695 
where a log message from an RPC callback blocked the reactor and could easily 
cause client calls to time out.


> Did you measure the effect on Impala, in a set of queries that would 
> previously have triggered unnecessary timeouts?

Nope, it's hard enough to reliably reproduce these issues that I haven't been 
able to show that it fixes them or not. e.g after running queries in a loop for 
a day in an earlier build, I only saw one timeout which I think was due to this 
issue. But, I've also seen these issues in Kudu itself on occasion. They're 
always rare enough that it's hard to repro outside of a manufactured thing like 
the test case included here.


> Relatedly, are you aware of any other client connectors that similarly try to 
> work around process pauses? For example, I'd imagine this would be a bigger 
> deal in the JVM where GC-related pauses are a more common thing; does Netty 
> employ something like this to soften the blow for clients?

I googled around and couldn't find any existing examples of people doing this 
technique. I agree it should be quite applicable in Java. Maybe we should 
suggest it to Hadoop/HBase teams :)

-- 
To view, visit http://gerrit.cloudera.org:8080/2745
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7bff0bc1573a059f12be8bd3f46e301275e78392
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: No

Reply via email to