The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and
15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),".

On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]> wrote:
> On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]> wrote:
>
>> I now have a Kudu master failing with "F1117 16:36:26.940562 113049
>> hybrid_clock.cc:227] Couldn't get the current time: Clock
>> unsynchronized. Status: Service unavailable: Error reading clock.
>> Clock considered unsynchronized"
>>
>> Todd, when you say "we need to wait for sync before running tests or
>> allow an unsafe flag to disable the check," do you mean "we" Impala,
>> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should
>> follow? I don't see a likely candidate at
>>
>> https://issues.apache.org/jira/browse/KUDU-1202?jql=
>> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock%
>> 20considered%20unsynchronized%5C%22%22
>
>
> Sorry, I was on my phone so wasn't as precise as I should have been. I
> think that either:
>
> (a) Impala could add a script which runs prior to tests which waits for
> synchronization. Apparently there is a builtin "ntp-wait" command in some
> distros. We use the following which also includes restarting NTP a few
> times which seems to speed things up:
> https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b97dfe48
>
> (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which you could
> use in minicluster tests. In a minicluster where all the daemons run on one
> node, it's sort of guaranteed that the clocks are in sync, so maybe it's
> not too bad of an idea. The only risk I can see is that the clock might be
> adjusted backwards if NTP is not in sync status yet, and we may have some
> assertions which would be unhappy if that happened. We could give it a try,
> though, if the "wait-for-ntp" approach doesn't help.
>
> -Todd
>
>
>
>
>
>
>>
>>
>> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]> wrote:
>> > This particular error message isn't related to the clock error itself.
>> It's
>> > the sync status flag which indicates whether NTP thinks its phase lock
>> loop
>> > is in sync or not. So, changing the allowed error shouldn't have any
>> > effect. Instead we need to wait for sync before running tests or allow an
>> > unsafe flag to disable the check.
>> >
>> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> wrote:
>> >
>> >> Yeah, I don't doubt that the ntp output when you checked it was wrong,
>> >> but it may be different from the time when Kudu failed to start, since
>> >> that was probably at least several minutes later (assuming you had to
>> >> find out about the failed job, ssh into the machine, discover the ntp
>> >> warning and then check it at that time).
>> >>
>> >> I agree adding some logging at the beginning of jobs is the best thing
>> >> to do now.
>> >>
>> >> Thanks!
>> >>
>> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <[email protected]>
>> wrote:
>> >> > I found a place where I suppose that could go
>> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm not sure
>> >> > what I want to set it at. If "maximum error 197431 us" is to be
>> >> > believed, it needs to be at least 0.2 seconds, and if
>> >> > https://kudu.apache.org/docs/troubleshooting.html is to be believed,
>> >> > it is already 10 seconds.
>> >> >
>> >> > I added some more logging to try and see what is going on, and if I
>> >> > hit this again I'll re-open the thread. For now, I am not planning to
>> >> > submit a patch to "fix" it because I'm not sure increasing the number
>> >> > is the real solution.
>> >> >
>> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs <[email protected]>
>> >> wrote:
>> >> >> According to the error message, it looks like we can specify the
>> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu processes.
>> >> >> We may want to start by printing ntptime output at the beginning of
>> >> >> jobs so we can see how far off it is. If it's off by days then maybe
>> >> >> changing the error isn't a good idea, and we'll need to figure out
>> >> >> something else.
>> >> >>
>> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <[email protected]>
>> >> wrote:
>> >> >>> How do we bump up the allowable error?
>> >> >>>
>> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs <[email protected]>
>> >> wrote:
>> >> >>>> I asked on the Kudu slack channel, they have seen issues where
>> freshly
>> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce, but they
>> >> >>>> didn't have a sense of how long that might take. If you checked
>> >> >>>> ntptime after the job failed, it may be that ntp had enough time.
>> We
>> >> >>>> can probably consider bumping up the allowable error.
>> >> >>>>
>> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple <[email protected]>
>> >> wrote:
>> >> >>>>> This is the second time I have seen it, but it doesn't happen
>> every
>> >> >>>>> time. It could very well be a difference on ec2; already I've seen
>> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone while
>> most
>> >> >>>>> Impala developers work in America/Los_Angeles.
>> >> >>>>>
>> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs <[email protected]>
>> >> wrote:
>> >> >>>>>> No problem. If this happens again we should ask the Kudu
>> >> developers. I
>> >> >>>>>> haven't seen this before - I wonder if it could be some
>> weirdness on
>> >> >>>>>> ec2...
>> >> >>>>>>
>> >> >>>>>> Thanks
>> >> >>>>>>
>> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple <[email protected]
>> >
>> >> wrote:
>> >> >>>>>>> Thank you for your help!
>> >> >>>>>>>
>> >> >>>>>>> This was on an AWS machine that has expired, but I can see from
>> the
>> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and
>> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading
>> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL
>> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264-
>> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2-
>> >> package-ubuntu-14-04.tar.gz".
>> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this machine does
>> on
>> >> >>>>>>> error, so we'll have it next time, but I did "ps -Afly" on exit
>> and
>> >> >>>>>>> there were no kudu processes running, it looks like.
>> >> >>>>>>>
>> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs <
>> [email protected]>
>> >> wrote:
>> >> >>>>>>>> Can you check which version of the client you're building
>> against
>> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running (ps aux
>> |
>> >> grep
>> >> >>>>>>>> kudu
>> >> >>>>>>>>
>> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple <
>> [email protected]>
>> >> wrote:
>> >> >>>>>>>>> Yes.
>> >> >>>>>>>>>
>> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs <
>> [email protected]>
>> >> wrote:
>> >> >>>>>>>>>> Do you have NTP installed?
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple <
>> >> [email protected]> wrote:
>> >> >>>>>>>>>>> I have a machine where Kudu failed to start:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> F1116 05:02:00.173629 71098 tablet_server_main.cc:64] Check
>> >> failed:
>> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot initialize
>> >> clock:
>> >> >>>>>>>>>>> Error reading clock. Clock considered unsynchronized
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html says:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> "For the master and tablet server daemons, the server’s
>> clock
>> >> must be
>> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum clock error
>> >> (not to
>> >> >>>>>>>>>>> be mistaken with the estimated error) be below a
>> configurable
>> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it can be
>> set
>> >> with the
>> >> >>>>>>>>>>> flag --max_clock_sync_error_usec."
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> and
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> "If NTP is installed the user can monitor the
>> synchronization
>> >> status
>> >> >>>>>>>>>>> by running ntptime. The relevant value is what is reported
>> for
>> >> maximum
>> >> >>>>>>>>>>> error."
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> ntptime reports:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK)
>> >> >>>>>>>>>>>   time dbd66a6a.59bca948  Wed, Nov 16 2016  5:17:30.350,
>> >> (.350535824),
>> >> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015 us, TAI
>> >> offset 0
>> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK)
>> >> >>>>>>>>>>>   modes 0x0 (),
>> >> >>>>>>>>>>>   offset 74989.459 us, frequency 19.950 ppm, interval 1 s,
>> >> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015 us,
>> >> >>>>>>>>>>>   status 0x2001 (PLL,NANO),
>> >> >>>>>>>>>>>   time constant 6, precision 0.001 us, tolerance 500 ppm,
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> So it looks like this error is anticipated, but the expected
>> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas what could
>> be
>> >> going
>> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala master.
>> >>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to