[ 
https://issues.apache.org/jira/browse/KUDU-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1387:
------------------------------
    Status: In Review  (was: Open)

> Scanner gets into tight loop followed by long sleep when leader TS is down
> --------------------------------------------------------------------------
>
>                 Key: KUDU-1387
>                 URL: https://issues.apache.org/jira/browse/KUDU-1387
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.7.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> I spent some time looking into linked_list-test flakiness and found the 
> following really bad behavior:
> - the leader TS of a tablet goes down
> - the other two replicas are up but in a state that prevents scanning (eg 
> bootstrapping)
> - the scanner tries to read the two "bootstrapping" tablets, and gets some 
> error which ends up adding the replicas to the scanner's blacklist.
> - it tries to read from the down TS, and gets "connection refused". We handle 
> this slightly differently since it's a TS-wide error, and we mark the TS as 
> down in the meta cache. We currently do _not_ add it to the scanner's 
> blacklist. We also do not do any backoff/sleep.
> - on the next LookupTablet call, we see that the last-known leader is marked 
> down, and decide that we should refresh locations from the master.
> -- when we get the response back from the master, we clear the 'failed' flag 
> on the replica. (I'm not sure this is justifiable, but it's the current 
> policy)
> - we then try to select a TS for the scanner again, and since the down 
> machine is no longer marked as "failed", and not in the blacklist, we select 
> it again.
> - the loop continues from the top
> Because there is no sleeping involved here, we end up doing hundreds or 
> thousands of round trips to the master here as we wait for the leader to 
> change. This would be somewhat bad on its own, but in fact the problem is 
> even worse: when the master finally learns of a new leader, the cycle breaks. 
> Meanwhile, we had increased the 'attempts' count to a very high value, and 
> then our backoff code decides it's a good idea to sleep for 800 seconds 
> (regardless of deadline, etc). This causes linked_list-test to time out and 
> probably would cause lots of problems for real use cases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to