[kudu-CR] KUDU-3304: [Alter] Support to alter table's replication factor

Andrew Wong (Code Review) Thu, 12 Aug 2021 19:06:05 -0700

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17674 )


Change subject: KUDU-3304: [Alter] Support to alter table's replication factor
......................................................................


Patch Set 12:

(1 comment)

> Patch Set 12:
>
> Thinking more on my previous comment, even if majority of the replicas are 
> not down, lets say 1 out of three replicas is down, if we reduce the 
> replication factor to 1, based on the below it appears it might choose the 
> unhealthy replicas to evict, but a test case to be sure wouldn't hurt to 
> ensure the healthy replicas are not evicted. 
> https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L4873
>
> OR
> Do we want to not change the replication factor if even one of the replicas 
> is not running to be on a safe side. Probably Alexey/Andrew/Bankim can add 
> their opinions

Right, the master will prioritize bad replicas to evict if it notices too many.

One question to consider is: if I run this tool, under what circumstances can 
partitions my table be permanently unavailable? Temporary blips wherein we are 
replicating upwards seem acceptable, but I'd be more hesitant about scenarios 
where running this tool would result in required manual intervention that 
wouldn't have required manual intervention otherwise. Nothing really comes to 
mind here, but it's worth pondering a bit.

http://gerrit.cloudera.org:8080/#/c/17674/10/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

http://gerrit.cloudera.org:8080/#/c/17674/10/src/kudu/client/client-test.cc@8539
PS10, Line 8539:     client_->data_->meta_cache_->ClearCache();
> Yes, I know the cache entries have TTL, and the metacache automatically upd
Just to clarify here, I think Alexey is suggesting, rather than relying on 
internal calls like ClearCache(), our test should test the public API, making 
sure nothing fishy happens when running a scan workload in the background. 
Perhaps consider updating this test to start a TestWorkload that has no write 
threads and has some read threads, pointing at this table? You may need to set 
the TestWorkload's schema to match the one in this test.

Regarding the writes scenario, the suggestion here again seems to be to make 
sure that nothing fishy happens (no errors, etc.) happens when moving the 
replication factor up or down. The case Alexey pointed out seems possible -- if 
we change from RF=1 to RF=3, the master would change the Raft config to have 
two replicas, but writes would then need to be replicated to a majority, which 
may not be available until a tablet copy has been completed. For that, perhaps 
consider a test with a TestWorkload with write threads, and make the cluster 
flush frequently via flags (--flush_threshold_secs, flush_threshold_mb), and 
injecting latency in the copy (e.g. 
--tablet_copy_download_file_inject_latency_ms)? TestWorkload can be configured 
to allow errors (e.g. set_timeout_allowed), and we could then check the number 
of errors which we'd expect to be non-zero, if the copy truly is blocking us 
from making progress on writes.

A brief period of write unavailability is probably fine -- the suggestions here 
are just to make sure we understand the behavior when running this new tool in 
various scenarios.



--
To view, visit http://gerrit.cloudera.org:8080/17674
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3aa2d5b12c508ba761fa9410ad1a465cf31fb7d7
Gerrit-Change-Number: 17674
Gerrit-PatchSet: 12
Gerrit-Owner: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Comment-Date: Fri, 13 Aug 2021 02:05:53 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-3304: [Alter] Support to alter table's replication factor

Reply via email to