[
https://issues.apache.org/jira/browse/KUDU-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon reassigned KUDU-1387:
---------------------------------
Assignee: Todd Lipcon
> Scanner gets into tight loop followed by long sleep when leader TS is down
> --------------------------------------------------------------------------
>
> Key: KUDU-1387
> URL: https://issues.apache.org/jira/browse/KUDU-1387
> Project: Kudu
> Issue Type: Bug
> Components: client
> Affects Versions: 0.7.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> I spent some time looking into linked_list-test flakiness and found the
> following really bad behavior:
> - the leader TS of a tablet goes down
> - the other two replicas are up but in a state that prevents scanning (eg
> bootstrapping)
> - the scanner tries to read the two "bootstrapping" tablets, and gets some
> error which ends up adding the replicas to the scanner's blacklist.
> - it tries to read from the down TS, and gets "connection refused". We handle
> this slightly differently since it's a TS-wide area, and we mark the TS as
> down in the meta cache. We currently do _not_ add it to the scanner's
> blacklist. We also do not do any backoff/sleep.
> - on the next LookupTablet call, we see that the last-known leader is marked
> down, and decide that we should refresh locations from the master.
> -- when we get the response back from the master, we clear the 'failed' flag
> on the replica. (I'm not sure this is justifiable, but it's the current
> policy)
> - we then try to select a TS for the scanner again, and since the down
> machine is no longer marked as "failed", and not in the blacklist, we select
> it again.
> - the loop continues from the top
> Because there is no sleeping involved here, we end up doing hundreds or
> thousands of round trips to the master here as we wait for the leader to
> change. This would be somewhat bad on its own, but in fact the problem is
> even worse: when the master finally learns of a new leader, the cycle breaks.
> Meanwhile, we had increased the 'attempts' count to a very high value, and
> then our backoff code decides it's a good idea to sleep for 800 seconds
> (regardless of deadline, etc). This causes linked_list-test to time out and
> probably would cause lots of problems for real use cases as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)