Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )
Change subject: [clock] add sanity check to detect wall clock jumps ...................................................................... Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10 PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver > That is a valuable information, Yuqi. +1 Yuqi, it would be great if could you share more information on what was the environment when this issue happened in your cluster. Was there anything specific that you think might be the culprit? As for Azure VMs, I'm suspecting (but it's not yet confirmed) that it somehow related to the activity of so-called VMICTimeSync service which updates VM's clock after no-downtime maintenance (a.k.a. memory preserving maintenance) [1] https://learn.microsoft.com/en-us/azure/virtual-machines/linux/time-sync --- snip --- Virtual machine interactions with the host can also affect the clock. During memory preserving maintenance, VMs are paused for up to 30 seconds. For example, before maintenance begins the VM clock shows 10:00:00 AM and lasts 28 seconds. After the VM resumes, the clock on the VM would still show 10:00:00 AM, which would be 28 seconds off. To correct for this, the VMICTimeSync service monitors what is happening on the host and updates the time-of-day clock in Linux VMs to compensate. --- snip --- [1] https://learn.microsoft.com/en-us/azure/virtual-machines/linux/time-sync http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc File src/kudu/clock/hybrid_clock.cc: http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@439 PS2, Line 439: if (is_wall_clock_jump_check_enabled_) { > I am not sure If it is better that use 'FLAGS_enable_wall_clock_jump_check' The idea in using constant member fields of the HybridClock class is to allow the compiler to optimize this code as much as possible since it is in the hot path of processing WriteRequestPB, etc. Supporting toggling this flag in run time doesn't bring a lot of benefits IMO. If the system clock jumped ahead when this flag was set 'true', the kudu-master/kudu-tserver would crash anyways. Vice versa, if this flag wasn't set 'true' when the clock jumped, the aftermath would be having far-in-the-future timestamps in the tablets' WALs of this server and other tablet servers where the write operation was replicated, so they would crash while replaying WAL during tablet bootstrap. So, there is 'forced' restart in both cases :) http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@448 PS2, Line 448: if (abs(wall_delta_usec - mono_delta_usec) > wall_clock_jump_threshold_usec_) > what about PREDICT_FALSE I think is a good idea! Done. -- To view, visit http://gerrit.cloudera.org:8080/19473 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275 Gerrit-Change-Number: 19473 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Comment-Date: Wed, 08 Feb 2023 19:27:41 +0000 Gerrit-HasComments: Yes
