Jean-Daniel Cryans has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/2978

Change subject: [java client] Disconnected TabletClient can double retry RPCs
......................................................................

[java client] Disconnected TabletClient can double retry RPCs

Noticed on Jenkins:
http://104.196.14.100/job/kudu-gerrit/1215/BUILD_TYPE=TSAN/testReport/junit/org.kududb.client/TestAsyncKuduClient/testDisconnect/

The relevant part of the log:

2016-05-04 21:46:49,082 (Thread-5) [DEBUG - 
org.kududb.client.TabletClient.encode(TabletClient.java:251)] [Peer 
93c65b3e604741709596f19c67955dcb] null Sending RPC #3, 
payload=BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) 
"\x00\x00\x00W/\x18\x032(\n 
kudu.tserver.TabletServerService\x12\x04ScanP\x90N&\n 
83efd2d326364b71b5d210c17f036201\x18\x01 \x01"

2016-05-04 21:46:49,646 (New I/O worker #69) [DEBUG - 
org.kududb.client.TabletClient.sendQueuedRpcs(TabletClient.java:757)] [Peer 
93c65b3e604741709596f19c67955dcb] Executing RPC queued: 
ScanRequest(scannerId="83efd2d326364b71b5d210c17f036201", 
tabletSlice=80478b37bfe5468399fbcf90a11fa515, attempt=3)
2016-05-04 21:46:49,646 (New I/O worker #69) [DEBUG - 
org.kududb.client.AsyncKuduScanner$ScanRequest.serialize(AsyncKuduScanner.java:773)]
 Sending scan req: scanner_id: "83efd2d326364b71b5d210c17f036201"
call_seq_id: 1
batch_size_bytes: 1

2016-05-04 21:46:49,648 (New I/O worker #69) [DEBUG - 
org.kududb.client.TabletClient.sendQueuedRpcs(TabletClient.java:757)] [Peer 
93c65b3e604741709596f19c67955dcb] Executing RPC queued: 
ScanRequest(scannerId="83efd2d326364b71b5d210c17f036201", 
tabletSlice=80478b37bfe5468399fbcf90a11fa515, attempt=3)
2016-05-04 21:46:49,648 (New I/O worker #69) [DEBUG - 
org.kududb.client.AsyncKuduScanner$ScanRequest.serialize(AsyncKuduScanner.java:773)]
 Sending scan req: scanner_id: "83efd2d326364b71b5d210c17f036201"
call_seq_id: 1
batch_size_bytes: 1

So what happens is that in the first log we can see "null Sending RPC #3", the
null part is because the Channel for that TabletClient was already sent to null.
We handle that later in sendRpc(), but then at the end of that method we 
directly
call failOrRetryRpc(). Why is that a problem? Because, before printing that 
first
log line, we added the RPC to the rpcs_inflight map. It means that on
channelDisconnected we'll call cleanup() which will also find the same RPC in
rpcs_inflight and also call failOrRetryRpc().

TestAsyncKuduClient#testDisconnect could find this bug because it forcefully
disconnects a TabletServer and then tries to continue using pre-opened scanner
right away. I haven't been able to trigger it myself, but the above explanation
is simple enough for me to believe that this is the right fix.

I also believe that asynchbase has the same problem, but TBH the only way to
encounter it is by doing what we're doing in the test.

Change-Id: Ice511265a3c0186fb7d50eaf1a16a4f952fa84de
---
M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
1 file changed, 9 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/78/2978/1
-- 
To view, visit http://gerrit.cloudera.org:8080/2978
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ice511265a3c0186fb7d50eaf1a16a4f952fa84de
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans

Reply via email to