[ 
https://issues.apache.org/jira/browse/KUDU-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1387:
------------------------------
    Description: 
I spent some time looking into linked_list-test flakiness and found the 
following really bad behavior:
- the leader TS of a tablet goes down
- the other two replicas are up but in a state that prevents scanning (eg 
bootstrapping)
- the scanner tries to read the two "bootstrapping" tablets, and gets some 
error which ends up adding the replicas to the scanner's blacklist.
- it tries to read from the down TS, and gets "connection refused". We handle 
this slightly differently since it's a TS-wide area, and we mark the TS as down 
in the meta cache. We currently do _not_ add it to the scanner's blacklist. We 
also do not do any backoff/sleep.
- on the next LookupTablet call, we see that the last-known leader is marked 
down, and decide that we should refresh locations from the master.
-- when we get the response back from the master, we clear the 'failed' flag on 
the replica. (I'm not sure this is justifiable, but it's the current policy)
- we then try to select a TS for the scanner again, and since the down machine 
is no longer marked as "failed", and not in the blacklist, we select it again.
- the loop continues from the top

Because there is no sleeping involved here, we end up doing hundreds or 
thousands of round trips to the master here as we wait for the leader to 
change. This would be somewhat bad on its own, but in fact the problem is even 
worse: when the master finally learns of a new leader, the cycle breaks. 
Meanwhile, we had increased the 'attempts' count to a very high value, and then 
our backoff code decides it's a good idea to sleep for 800 seconds (regardless 
of deadline, etc). This causes linked_list-test to time out and probably would 
cause lots of problems for real use cases as well.


  was:
I spent some time looking into linked_list-test flakiness and found the 
following really bad behavior:
- the leader TS of a tablet goes down
- the other two replicas are up but in a state that prevents scanning (eg 
bootstrapping)
- the scanner tries to read the two "bootstrapping" tablets, and gets some 
error which ends up adding the replicas to the scanner's blacklist.
- it tries to read from the down TS, and gets "connection refused". We handle 
this slightly differently since it's a TS-wide area, and we mark the TS as down 
in the meta cache. We currently do _not_ add it to the scanner's blacklist. We 
also do not do any backoff/sleep.
- on the next LookupTablet call, we see that the last-known leader is marked 
down, and decide that we should refresh locations from the master.
-- when we get the response back from the master, we clear the 'failed' flag on 
the replica. (I'm not sure this is justifiable, but it's the current policy)
- we then try to select a TS for the scanner again, and since the down machine 
is no longer marked as "failed", and not in the blacklist, we select it again.
- the loop continues from the top

Because there is no sleeping involved here, we end up doing hundreds or 
thousands of round trips to the master here as we wait for the leader to 
change. This would be somewhat bad on its own, but in fact the problem is even 
worse: when the master finally learns of a new leader, the cycle breaks. 
Meanwhile, we had increased the 'attempts' count to a very high value, and then 
our backoff code decides it's a good idea to sleep for 800 seconds (regardless 
of deadline, etc). This causes linked_list-test to time out and probably would 
cause lots of problems for real use cases as well.




Currently, we handle a 'down' TS by calling MarkTSFailed.



> Scanner gets into tight loop followed by long sleep when leader TS is down
> --------------------------------------------------------------------------
>
>                 Key: KUDU-1387
>                 URL: https://issues.apache.org/jira/browse/KUDU-1387
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.7.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> I spent some time looking into linked_list-test flakiness and found the 
> following really bad behavior:
> - the leader TS of a tablet goes down
> - the other two replicas are up but in a state that prevents scanning (eg 
> bootstrapping)
> - the scanner tries to read the two "bootstrapping" tablets, and gets some 
> error which ends up adding the replicas to the scanner's blacklist.
> - it tries to read from the down TS, and gets "connection refused". We handle 
> this slightly differently since it's a TS-wide area, and we mark the TS as 
> down in the meta cache. We currently do _not_ add it to the scanner's 
> blacklist. We also do not do any backoff/sleep.
> - on the next LookupTablet call, we see that the last-known leader is marked 
> down, and decide that we should refresh locations from the master.
> -- when we get the response back from the master, we clear the 'failed' flag 
> on the replica. (I'm not sure this is justifiable, but it's the current 
> policy)
> - we then try to select a TS for the scanner again, and since the down 
> machine is no longer marked as "failed", and not in the blacklist, we select 
> it again.
> - the loop continues from the top
> Because there is no sleeping involved here, we end up doing hundreds or 
> thousands of round trips to the master here as we wait for the leader to 
> change. This would be somewhat bad on its own, but in fact the problem is 
> even worse: when the master finally learns of a new leader, the cycle breaks. 
> Meanwhile, we had increased the 'attempts' count to a very high value, and 
> then our backoff code decides it's a good idea to sleep for 800 seconds 
> (regardless of deadline, etc). This causes linked_list-test to time out and 
> probably would cause lots of problems for real use cases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to