Alexey Serbin has posted comments on this change.

Change subject: KUDU-2027 retry scan RPC if negotiation times out
......................................................................


Patch Set 14:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7037/14//COMMIT_MSG
Commit Message:

PS14, Line 15: handling. Now, regardless of whether it was the RPC or the 
overall
             : operation timeout, the corresponding tablet server is marked as 
failed
             : and blacklisted.
> we don't currently invalidate this, right? Given that systems like Impala u
Right -- without this patch (i.e. currently) we don't invalidate corresponding 
entries in the client's meta-cache on read timeout.

As I understand, if the scan source is set to anything than LEADER_ONLY, the 
client should try other available replicas.  And if it runs out of replicas, 
marking them failed, then it will not be able to continue -- that's a good 
observation.

Should we re-fetch information on replicas in case of running out of available 
replicas in client meta-cache (stopping doing that after some tries)?

What else can we do to handle this situation if not marking failed ones?  Do 
some round-robin among available sources?


http://gerrit.cloudera.org:8080/#/c/7037/14/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

PS14, Line 1991:   // The scanner should timeout on the next batch not fetched 
by the Open().
               :   // The client should not retry until the overall timeout 
expires, giving up
               :   // after the fist RPC timeout error: the only available 
replica is marked
               :   // as failed by the first timeout error.
> I'm not really understanding this comment.
Agreed, it sounds like some mumbling.  Let me rephrase it:

In the scenario below, the scanner has some rows not fetched upon calling 
KuduScanner::Open() because scanner_max_batch_size_bytes is set to a very small 
value.  There will be timeout on first call of KuduScanner::NextBatch() after 
scanner is opened.  Since the only available entry in the client meta-cache is 
marked as failed after timing out on the call of KuduScanner::NextBatch(), the 
client has no other sources for the scanner to try and will fail right after 
the KuduScanner::NextBatch() call.

Is it better?


Line 2021:     ASSERT_LT(sw.elapsed().wall_millis(), kScanTimeoutMs);
> in the case that we know we won't be able to retry a scan elsewhere, should
It sounds like a good idea.

Now there is allow_time_for_failover provision as well, it's just not activated 
in this case because I simplified the logic at a few call sites.  I'll take a 
look.


-- 
To view, visit http://gerrit.cloudera.org:8080/7037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I05a3c38cc07c9b6edd1ae773d9850e92645f3b14
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to