Todd Lipcon has posted comments on this change. Change subject: KUDU-1409. Make krpc call timeouts more resistant to process or reactor pauses ......................................................................
Patch Set 3: > Pretty clever solution, though I wonder how widespread this issue is outside > of Impala. Just to make sure I understand, this is only a mitigation > mechanism at best, because it can only reduce the likelihood of process > pauses triggering Kudu timeouts, right? That is, a pause during the second > stage will still result in a timeout even if the RPC result is sitting in the > socket. Right, if you have a 5 second timeout, and the actual response arrived at 4.9 seconds, and the client happened to pause between 4.8 seconds and 5.2 seconds, you'd still get a "false" timeout. But, this patch means that you won't ever get false timeouts due to pauses in the first 90% of the allotted time. Given people usually set timeouts to be much longer than the expected response time, I think this is likely to solve most of the problems. Outside of Impala, I've also seen timeouts in various tests due to reactor threads blocked in glog. EG see https://issues.apache.org/jira/browse/KUDU-695 where a log message from an RPC callback blocked the reactor and could easily cause client calls to time out. > Did you measure the effect on Impala, in a set of queries that would > previously have triggered unnecessary timeouts? Nope, it's hard enough to reliably reproduce these issues that I haven't been able to show that it fixes them or not. e.g after running queries in a loop for a day in an earlier build, I only saw one timeout which I think was due to this issue. But, I've also seen these issues in Kudu itself on occasion. They're always rare enough that it's hard to repro outside of a manufactured thing like the test case included here. > Relatedly, are you aware of any other client connectors that similarly try to > work around process pauses? For example, I'd imagine this would be a bigger > deal in the JVM where GC-related pauses are a more common thing; does Netty > employ something like this to soften the blow for clients? I googled around and couldn't find any existing examples of people doing this technique. I agree it should be quite applicable in Java. Maybe we should suggest it to Hadoop/HBase teams :) -- To view, visit http://gerrit.cloudera.org:8080/2745 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7bff0bc1573a059f12be8bd3f46e301275e78392 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: No
