Todd Lipcon created KUDU-1466:
---------------------------------

             Summary: C++ client errors misreported as GetTableLocations 
timeouts
                 Key: KUDU-1466
                 URL: https://issues.apache.org/jira/browse/KUDU-1466
             Project: Kudu
          Issue Type: Bug
          Components: client
    Affects Versions: 0.8.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Critical


client-test is currently very flaky due to this issue:
- we are injecting some kind of failure on the tablet server (eg DNS resolution 
failure)
- when we fail to connect to the TS, we correctly re-trigger a lookup against 
the master
- depending how the backoffs and retries line up, we sometimes end up 
triggering the lookup retry when the remaining operation budget is very short 
(eg <10ms)
-- this GetTabletLocations RPC times out since the master is unable to respond 
within the ridiculously short timeout

During the course of retrying some operation, we should probably not replace 
the 'last_error' with a master error, so long as we have had at least one 
successful master lookup (thus indicating that the master is not the problem)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to