[
https://issues.apache.org/jira/browse/KUDU-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078446#comment-16078446
]
Alexey Serbin commented on KUDU-694:
------------------------------------
An update to summarize current state of affairs (as far as I could see):
* The first item still holds Marking the server failed is specific for the
tablet, so if querying some other tablet on the same server will not be
affected by the mark done for prior one. But it still affects the scans with
the LEADER_ONLY selector.
* Not failing-over to another leader during the call is addressed: if there was
an error from the server hosting the leader tablet (or any other tablet), the
{{LookupRpc::SendRpc()}} will not use the 'fast path' and do server resolution
again calling {{MasterServerProxy::GetTableLocationsAsync()}}
* The non-retried {{GetTabletServer()}} is retried from the upper level (i.e.
in KuduScanner::Data::OpenTablet()), but a failure of DNS resolution in the
path of {{KuduClient::Data::GetTabletServer()}} will result in a non-retriable
error returned to the top-level from {{KuduScanner::Data::OpenTablet()}}.
Also, I suspect there other places like that -- an additional revision is
needed. Besides, we need to understand whether it makes sense to retry in such
cases.
> Re-visit C++ client scan retry logic
> ------------------------------------
>
> Key: KUDU-694
> URL: https://issues.apache.org/jira/browse/KUDU-694
> Project: Kudu
> Issue Type: Bug
> Components: client
> Affects Versions: Private Beta
> Reporter: Andrew Wang
>
> There are a number of remaining issues with scanner robustness, even after
> KUDU-597:
> * Once a node is marked as failed, it will not be used again in the call.
> This is more of an issue with longer timeouts (since the node is more likely
> to come back), or if the scan is LEADER_ONLY (since only one node being down
> leads to unavailability).
> * In the LEADER_ONLY case, since we don't refresh quorum information within
> the call, we won't recover when a failover happens.
> * The scanner code calls a number of other RPCs that are not retried on
> error, i.e. LookupTabletByKey or RefreshProxy's DNS resolution in
> GetTabletServer.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)