[
https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-2354:
------------------------------
Target Version/s: (was: 1.8.0)
> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly
> retries operations to add a replacement replica even if replacement is no
> longer needed
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KUDU-2354
> URL: https://issues.apache.org/jira/browse/KUDU-2354
> Project: Kudu
> Issue Type: Bug
> Components: master
> Affects Versions: 1.7.0
> Environment: 3 tservers in the cluster, single master (?)
> Reporter: Alexey Serbin
> Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command
> were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40
> --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported
> UNAVAILABLE tablets for 5-10 minutes after that. Most likely, due to the
> spike of IO activity, tablet leaders didn't receive heartbeats from some
> replicas and tried to replace those. After some time, the cluster has
> stabilized (no problems reported by ksck), but in the master's log the
> following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04
> (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of
> ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet
> 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay
> of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to
> add a replacement non-voter replica would succeed, but it would make sense to
> stop retrying those operations when a tablet's OpId index is far ahead of the
> cas_config_opid_index of the operation being retried.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)