Jean-Daniel Cryans has submitted this change and it was merged.

Change subject: [java client] Disconnected TabletClient can double retry RPCs
......................................................................


[java client] Disconnected TabletClient can double retry RPCs

Noticed on Jenkins:
http://104.196.14.100/job/kudu-gerrit/1215/BUILD_TYPE=TSAN/testReport/junit/org.kududb.client/TestAsyncKuduClient/testDisconnect/

The relevant part of the log:

2016-05-04 21:46:49,082 (Thread-5) [DEBUG - 
org.kududb.client.TabletClient.encode(TabletClient.java:251)] [Peer 
93c65b3e604741709596f19c67955dcb] null Sending RPC #3, 
payload=BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) 
"\x00\x00\x00W/\x18\x032(\n 
kudu.tserver.TabletServerService\x12\x04ScanP\x90N&\n 
83efd2d326364b71b5d210c17f036201\x18\x01 \x01"

2016-05-04 21:46:49,646 (New I/O worker #69) [DEBUG - 
org.kududb.client.TabletClient.sendQueuedRpcs(TabletClient.java:757)] [Peer 
93c65b3e604741709596f19c67955dcb] Executing RPC queued: 
ScanRequest(scannerId="83efd2d326364b71b5d210c17f036201", 
tabletSlice=80478b37bfe5468399fbcf90a11fa515, attempt=3)
2016-05-04 21:46:49,646 (New I/O worker #69) [DEBUG - 
org.kududb.client.AsyncKuduScanner$ScanRequest.serialize(AsyncKuduScanner.java:773)]
 Sending scan req: scanner_id: "83efd2d326364b71b5d210c17f036201"
call_seq_id: 1
batch_size_bytes: 1

2016-05-04 21:46:49,648 (New I/O worker #69) [DEBUG - 
org.kududb.client.TabletClient.sendQueuedRpcs(TabletClient.java:757)] [Peer 
93c65b3e604741709596f19c67955dcb] Executing RPC queued: 
ScanRequest(scannerId="83efd2d326364b71b5d210c17f036201", 
tabletSlice=80478b37bfe5468399fbcf90a11fa515, attempt=3)
2016-05-04 21:46:49,648 (New I/O worker #69) [DEBUG - 
org.kududb.client.AsyncKuduScanner$ScanRequest.serialize(AsyncKuduScanner.java:773)]
 Sending scan req: scanner_id: "83efd2d326364b71b5d210c17f036201"
call_seq_id: 1
batch_size_bytes: 1

So what happens is that in the first log we can see "null Sending RPC #3", the
null part is because the Channel for that TabletClient was already sent to null.
We handle that later in sendRpc(), but then at the end of that method we 
directly
call failOrRetryRpc(). Why is that a problem? Because, before printing that 
first
log line, we added the RPC to the rpcs_inflight map. It means that on
channelDisconnected we'll call cleanup() which will also find the same RPC in
rpcs_inflight and also call failOrRetryRpc().

TestAsyncKuduClient#testDisconnect could find this bug because it forcefully
disconnects a TabletServer and then tries to continue using pre-opened scanner
right away. I haven't been able to trigger it myself, but the above explanation
is simple enough for me to believe that this is the right fix.

I also believe that asynchbase has the same problem, but TBH the only way to
encounter it is by doing what we're doing in the test.

Change-Id: Ice511265a3c0186fb7d50eaf1a16a4f952fa84de
Reviewed-on: http://gerrit.cloudera.org:8080/2978
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <[email protected]>
---
M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
1 file changed, 9 insertions(+), 1 deletion(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/2978
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ice511265a3c0186fb7d50eaf1a16a4f952fa84de
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins

Reply via email to