Todd Lipcon has posted comments on this change.

Change subject: KUDU-2027 retry scan RPC if negotiation times out
......................................................................


Patch Set 14:

(3 comments)

didn't look in detail at the code yet, but got some higher level design 
questions here

http://gerrit.cloudera.org:8080/#/c/7037/14//COMMIT_MSG
Commit Message:

PS14, Line 15: handling. Now, regardless of whether it was the RPC or the 
overall
             : operation timeout, the corresponding tablet server is marked as 
failed
             : and blacklisted.
we don't currently invalidate this, right? Given that systems like Impala use a 
long-running client in the backend, it seems slightly dangerous that a single 
read timeout will "permanently" blacklist a server (at least until the other 
remaining replicas get blacklisted too).

With writes this isn't that impactful, since we could never get "stuck" on a 
non-leader, and thus we'll always end up going back to the leader. But with 
reads I think it's quite plausible, no?


http://gerrit.cloudera.org:8080/#/c/7037/14/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

PS14, Line 1991:   // The scanner should timeout on the next batch not fetched 
by the Open().
               :   // The client should not retry until the overall timeout 
expires, giving up
               :   // after the fist RPC timeout error: the only available 
replica is marked
               :   // as failed by the first timeout error.
I'm not really understanding this comment.


Line 2021:     ASSERT_LT(sw.elapsed().wall_millis(), kScanTimeoutMs);
in the case that we know we won't be able to retry a scan elsewhere, shouldn't 
we set the timeout to be the full remaining user-specified timeout?

i.e I think if the user specifies 60sec timeout, and we have a non-retriable 
operation like "nextbatch" on a scan, then we should use that entire 60sec 
timeout on the one server they gave. It's always possible that it will come 
back after 45sec or whatever, no?

I think this is what the 'allow_time_for_failover' thing was doing before?


-- 
To view, visit http://gerrit.cloudera.org:8080/7037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I05a3c38cc07c9b6edd1ae773d9850e92645f3b14
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to