[jira] [Updated] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

Grant Henke (Jira) Tue, 02 Jun 2020 19:30:48 -0700


     [ 
https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Grant Henke updated KUDU-2354:
------------------------------
    Target Version/s:   (was: 1.8.0)

> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly 
> retries operations to add a replacement replica even if replacement is no 
> longer needed
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2354
>                 URL: https://issues.apache.org/jira/browse/KUDU-2354
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0
>         Environment: 3 tservers in the cluster, single master (?)
>            Reporter: Alexey Serbin
>            Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command 
> were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40 
> --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported 
> UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the 
> spike of IO activity, tablet leaders didn't receive heartbeats from some 
> replicas and tried to replace those.  After some time, the cluster has 
> stabilized (no problems reported by ksck), but in the master's log the 
> following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
> (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
> ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 
> 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay 
> of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to 
> add a replacement non-voter replica would succeed, but it would make sense to 
> stop retrying those operations when a tablet's OpId index is far ahead of the 
> cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

Reply via email to