Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running 
kudu-tserver
> That is a valuable information, Yuqi.
+1

Yuqi, it would be great if could you share more information on what was the 
environment when this issue happened in your cluster.  Was there anything 
specific that you think might be the culprit?

As for Azure VMs, I'm suspecting (but it's not yet confirmed) that it somehow 
related to the activity of so-called VMICTimeSync service which updates VM's 
clock after no-downtime maintenance (a.k.a. memory preserving maintenance) [1] 
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/time-sync

--- snip ---
Virtual machine interactions with the host can also affect the clock. During 
memory preserving maintenance, VMs are paused for up to 30 seconds. For 
example, before maintenance begins the VM clock shows 10:00:00 AM and lasts 28 
seconds. After the VM resumes, the clock on the VM would still show 10:00:00 
AM, which would be 28 seconds off. To correct for this, the VMICTimeSync 
service monitors what is happening on the host and updates the time-of-day 
clock in Linux VMs to compensate.
--- snip ---

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/linux/time-sync


http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@439
PS2, Line 439:     if (is_wall_clock_jump_check_enabled_) {
> I am not sure If it is better that use 'FLAGS_enable_wall_clock_jump_check'
The idea in using constant member fields of the HybridClock class is to allow 
the compiler to optimize this code as much as possible since it is in the hot 
path of processing WriteRequestPB, etc.

Supporting toggling this flag in run time doesn't bring a lot of benefits IMO.  
If the system clock jumped ahead when this flag was set 'true', the 
kudu-master/kudu-tserver would crash anyways.  Vice versa, if this flag wasn't 
set 'true' when the clock jumped, the aftermath would be having 
far-in-the-future timestamps in the tablets' WALs of this server and other 
tablet servers where the write operation was replicated, so they would crash 
while replaying WAL during tablet bootstrap.  So, there is 'forced' restart in 
both cases :)


http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@448
PS2, Line 448: if (abs(wall_delta_usec - mono_delta_usec) > 
wall_clock_jump_threshold_usec_)
> what about PREDICT_FALSE
I think is a good idea!  Done.



--
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Yuqi Du <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Comment-Date: Wed, 08 Feb 2023 19:27:41 +0000
Gerrit-HasComments: Yes

Reply via email to