[kudu-CR] KUDU-3571: fix flakiness in AutoIncrementingItest.BootstrapNoWalsNoData

Alexey Serbin (Code Review) Wed, 18 Dec 2024 19:15:18 -0800

Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22133 )


Change subject: KUDU-3571: fix flakiness in 
AutoIncrementingItest.BootstrapNoWalsNoData
......................................................................


Patch Set 7: Code-Review+2

(3 comments)

http://gerrit.cloudera.org:8080/#/c/22133/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22133/1//COMMIT_MSG@10
PS1, Line 10:  not been initialized or inab
> Thank you all for reviewing!
Alright, this looks like a reliable solution to me.

Thank you!


http://gerrit.cloudera.org:8080/#/c/22133/1//COMMIT_MSG@10
PS1, Line 10:  not been initialized or inab
> IIUC, leader does send updates right away but on peers, the update may take 
> time upto FLAGS_raft_heartbeat_interval_ms duration.

An update is first stored into the follower replica's WAL, and only after that 
it's 'prepared' and 'applied'.  Only after completion of all these phases, the 
update is visible to a Kudu client that reads data from the follower replica.

Upon the completion of storing the operation in the WAL the replica acks the 
operation to the leader replica with corresponding response to the original 
RPC, and that's how the leader replica knows the follower has persisted the 
data.  It might take much longer than Raft heartbeat interval for the 
acknowledgment to arrive to the leader replica, of course.  There isn't an 
upper limit there, except for the overall timeout for the Raft consensus RPC.

As for 'prepare' and 'apply' phases, those are separate phases, and they might 
take a long time as well under certain conditions, especially if the apply 
queue is very long (e.g., see KUDU-1587 for anecdotal evidence apply queue wait 
times).


http://gerrit.cloudera.org:8080/#/c/22133/1//COMMIT_MSG@10
PS1, Line 10:  not been initialized or inab
> It helps to catch any regressions - for example at any point leader would 
> have flushed down data to disk with a higher probability than the followers, 
> in such cases where we have to populate the counter where few replicas have 
> flushed the data and few replicas did not, we should not expect different 
> values for the counter.

Alright, but isn't it covered by many existing tests for tablet Raft consensus 
already?

OK, it seems CheckCluster() for ClusterVerifier should take care of this anyway 
as of PS7.



--
To view, visit http://gerrit.cloudera.org:8080/22133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5bd387c82b632dbb77aa5a45f831273392ae05b4
Gerrit-Change-Number: 22133
Gerrit-PatchSet: 7
Gerrit-Owner: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Comment-Date: Thu, 19 Dec 2024 03:10:50 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-3571: fix flakiness in AutoIncrementingItest.BootstrapNoWalsNoData

Reply via email to