Alexey Serbin has posted comments on this change. Change subject: KUDU-2027 retry scan RPC if negotiation times out ......................................................................
Patch Set 14: (3 comments) http://gerrit.cloudera.org:8080/#/c/7037/14//COMMIT_MSG Commit Message: PS14, Line 15: handling. Now, regardless of whether it was the RPC or the overall : operation timeout, the corresponding tablet server is marked as failed : and blacklisted. > we don't currently invalidate this, right? Given that systems like Impala u Right -- without this patch (i.e. currently) we don't invalidate corresponding entries in the client's meta-cache on read timeout. As I understand, if the scan source is set to anything than LEADER_ONLY, the client should try other available replicas. And if it runs out of replicas, marking them failed, then it will not be able to continue -- that's a good observation. Should we re-fetch information on replicas in case of running out of available replicas in client meta-cache (stopping doing that after some tries)? What else can we do to handle this situation if not marking failed ones? Do some round-robin among available sources? http://gerrit.cloudera.org:8080/#/c/7037/14/src/kudu/client/client-test.cc File src/kudu/client/client-test.cc: PS14, Line 1991: // The scanner should timeout on the next batch not fetched by the Open(). : // The client should not retry until the overall timeout expires, giving up : // after the fist RPC timeout error: the only available replica is marked : // as failed by the first timeout error. > I'm not really understanding this comment. Agreed, it sounds like some mumbling. Let me rephrase it: In the scenario below, the scanner has some rows not fetched upon calling KuduScanner::Open() because scanner_max_batch_size_bytes is set to a very small value. There will be timeout on first call of KuduScanner::NextBatch() after scanner is opened. Since the only available entry in the client meta-cache is marked as failed after timing out on the call of KuduScanner::NextBatch(), the client has no other sources for the scanner to try and will fail right after the KuduScanner::NextBatch() call. Is it better? Line 2021: ASSERT_LT(sw.elapsed().wall_millis(), kScanTimeoutMs); > in the case that we know we won't be able to retry a scan elsewhere, should It sounds like a good idea. Now there is allow_time_for_failover provision as well, it's just not activated in this case because I simplified the logic at a few call sites. I'll take a look. -- To view, visit http://gerrit.cloudera.org:8080/7037 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I05a3c38cc07c9b6edd1ae773d9850e92645f3b14 Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
