Alexey Serbin created KUDU-3521:
-----------------------------------

             Summary: Kudu servers sometimes crash when host clock is 
synchronized with PTP
                 Key: KUDU-3521
                 URL: https://issues.apache.org/jira/browse/KUDU-3521
             Project: Kudu
          Issue Type: Bug
            Reporter: Alexey Serbin


This issue has been reported on the [\#kudu-general Slack 
channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
{{kudu-tserver}}, but it doesn't matter) crashed with the following error:

{noformat}
F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() unable 
to get current timestamp with error bound: Service unavailable: clock error 
estimate (18446744073709551615us) too high (clock considered synchronized by 
the kernel)
{noformat}

>From the analysis of the [code in 
>hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
> the only case it could happen is when {{t.maxerror}} turned to be a negative 
>number (e.g., -1) in [this 
>code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].

Negative values of the {{timex::maxerror}} field have never been seen when 
using ntpd or chronyd for clock synchronization, but it's necessary to update 
the code to adapt for such situations: apparently, PTP might set the 
{{maxerror}} field of the {{timex}} structure to a negative value and then call 
{{adjtimex()}}.  That's obvious from [the PTPd's 
code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
  The essence of the issue is using unsigned integers for clock error in the 
Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets it 
to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
documentation for 
adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
stated that the {{maxerror}} field's value should be a non-negative number.

As a side note, there was [a prior attempt to address this 
issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to