[
https://issues.apache.org/jira/browse/KUDU-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078475#comment-16078475
]
Alexey Serbin commented on KUDU-694:
------------------------------------
As a side-note, it would be nice to understand whether some of our tests cover
those issues at all.
Probably, the best way of categorization and addressing the scan retry logic
would be putting up a set of use-cases and create a set of tests asserting the
desired behavior. Most likely, more than 50% of that is already covered by
existing tests, but I'm not sure it's any close to the 100% mark.
> Re-visit C++ client scan retry logic
> ------------------------------------
>
> Key: KUDU-694
> URL: https://issues.apache.org/jira/browse/KUDU-694
> Project: Kudu
> Issue Type: Bug
> Components: client
> Affects Versions: Private Beta
> Reporter: Andrew Wang
>
> There are a number of remaining issues with scanner robustness, even after
> KUDU-597:
> * Once a node is marked as failed, it will not be used again in the call.
> This is more of an issue with longer timeouts (since the node is more likely
> to come back), or if the scan is LEADER_ONLY (since only one node being down
> leads to unavailability).
> * In the LEADER_ONLY case, since we don't refresh quorum information within
> the call, we won't recover when a failover happens.
> * The scanner code calls a number of other RPCs that are not retried on
> error, i.e. LookupTabletByKey or RefreshProxy's DNS resolution in
> GetTabletServer.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)