[ 
https://issues.apache.org/jira/browse/KUDU-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15558150#comment-15558150
 ] 

David Alves commented on KUDU-1466:
-----------------------------------

We could collect, and add a way to get, all the errors seen in the lifetime of 
an operation like a session write. This would likely help a lot with debugging 
as we would be able to pinpoint repeated errors, etc., and we wouldn't be force 
to choose the "best" error.

Space would be a consideration, but as long as we don't have tight retry loops 
maybe it wouldn't be so bad?

> C++ client errors misreported as GetTableLocations timeouts
> -----------------------------------------------------------
>
>                 Key: KUDU-1466
>                 URL: https://issues.apache.org/jira/browse/KUDU-1466
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> client-test is currently very flaky due to this issue:
> - we are injecting some kind of failure on the tablet server (eg DNS 
> resolution failure)
> - when we fail to connect to the TS, we correctly re-trigger a lookup against 
> the master
> - depending how the backoffs and retries line up, we sometimes end up 
> triggering the lookup retry when the remaining operation budget is very short 
> (eg <10ms)
> -- this GetTabletLocations RPC times out since the master is unable to 
> respond within the ridiculously short timeout
> During the course of retrying some operation, we should probably not replace 
> the 'last_error' with a master error, so long as we have had at least one 
> successful master lookup (thus indicating that the master is not the problem)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to