Jean-Daniel Cryans has submitted this change and it was merged. Change subject: [java client] Disconnected TabletClient can double retry RPCs ......................................................................
[java client] Disconnected TabletClient can double retry RPCs Noticed on Jenkins: http://104.196.14.100/job/kudu-gerrit/1215/BUILD_TYPE=TSAN/testReport/junit/org.kududb.client/TestAsyncKuduClient/testDisconnect/ The relevant part of the log: 2016-05-04 21:46:49,082 (Thread-5) [DEBUG - org.kududb.client.TabletClient.encode(TabletClient.java:251)] [Peer 93c65b3e604741709596f19c67955dcb] null Sending RPC #3, payload=BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) "\x00\x00\x00W/\x18\x032(\n kudu.tserver.TabletServerService\x12\x04ScanP\x90N&\n 83efd2d326364b71b5d210c17f036201\x18\x01 \x01" 2016-05-04 21:46:49,646 (New I/O worker #69) [DEBUG - org.kududb.client.TabletClient.sendQueuedRpcs(TabletClient.java:757)] [Peer 93c65b3e604741709596f19c67955dcb] Executing RPC queued: ScanRequest(scannerId="83efd2d326364b71b5d210c17f036201", tabletSlice=80478b37bfe5468399fbcf90a11fa515, attempt=3) 2016-05-04 21:46:49,646 (New I/O worker #69) [DEBUG - org.kududb.client.AsyncKuduScanner$ScanRequest.serialize(AsyncKuduScanner.java:773)] Sending scan req: scanner_id: "83efd2d326364b71b5d210c17f036201" call_seq_id: 1 batch_size_bytes: 1 2016-05-04 21:46:49,648 (New I/O worker #69) [DEBUG - org.kududb.client.TabletClient.sendQueuedRpcs(TabletClient.java:757)] [Peer 93c65b3e604741709596f19c67955dcb] Executing RPC queued: ScanRequest(scannerId="83efd2d326364b71b5d210c17f036201", tabletSlice=80478b37bfe5468399fbcf90a11fa515, attempt=3) 2016-05-04 21:46:49,648 (New I/O worker #69) [DEBUG - org.kududb.client.AsyncKuduScanner$ScanRequest.serialize(AsyncKuduScanner.java:773)] Sending scan req: scanner_id: "83efd2d326364b71b5d210c17f036201" call_seq_id: 1 batch_size_bytes: 1 So what happens is that in the first log we can see "null Sending RPC #3", the null part is because the Channel for that TabletClient was already sent to null. We handle that later in sendRpc(), but then at the end of that method we directly call failOrRetryRpc(). Why is that a problem? Because, before printing that first log line, we added the RPC to the rpcs_inflight map. It means that on channelDisconnected we'll call cleanup() which will also find the same RPC in rpcs_inflight and also call failOrRetryRpc(). TestAsyncKuduClient#testDisconnect could find this bug because it forcefully disconnects a TabletServer and then tries to continue using pre-opened scanner right away. I haven't been able to trigger it myself, but the above explanation is simple enough for me to believe that this is the right fix. I also believe that asynchbase has the same problem, but TBH the only way to encounter it is by doing what we're doing in the test. Change-Id: Ice511265a3c0186fb7d50eaf1a16a4f952fa84de Reviewed-on: http://gerrit.cloudera.org:8080/2978 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <[email protected]> --- M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java 1 file changed, 9 insertions(+), 1 deletion(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/2978 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ice511265a3c0186fb7d50eaf1a16a4f952fa84de Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins
