This particular error message isn't related to the clock error itself. It's
the sync status flag which indicates whether NTP thinks its phase lock loop
is in sync or not. So, changing the allowed error shouldn't have any
effect. Instead we need to wait for sync before running tests or allow an
unsafe flag to disable the check.

On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> wrote:

> Yeah, I don't doubt that the ntp output when you checked it was wrong,
> but it may be different from the time when Kudu failed to start, since
> that was probably at least several minutes later (assuming you had to
> find out about the failed job, ssh into the machine, discover the ntp
> warning and then check it at that time).
>
> I agree adding some logging at the beginning of jobs is the best thing
> to do now.
>
> Thanks!
>
> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <[email protected]> wrote:
> > I found a place where I suppose that could go
> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm not sure
> > what I want to set it at. If "maximum error 197431 us" is to be
> > believed, it needs to be at least 0.2 seconds, and if
> > https://kudu.apache.org/docs/troubleshooting.html is to be believed,
> > it is already 10 seconds.
> >
> > I added some more logging to try and see what is going on, and if I
> > hit this again I'll re-open the thread. For now, I am not planning to
> > submit a patch to "fix" it because I'm not sure increasing the number
> > is the real solution.
> >
> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs <[email protected]>
> wrote:
> >> According to the error message, it looks like we can specify the
> >> '--max_clock_sync_error_usec' flag when starting the Kudu processes.
> >> We may want to start by printing ntptime output at the beginning of
> >> jobs so we can see how far off it is. If it's off by days then maybe
> >> changing the error isn't a good idea, and we'll need to figure out
> >> something else.
> >>
> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <[email protected]>
> wrote:
> >>> How do we bump up the allowable error?
> >>>
> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs <[email protected]>
> wrote:
> >>>> I asked on the Kudu slack channel, they have seen issues where freshly
> >>>> provisioned ec2 nodes take some time for ntp to quiesce, but they
> >>>> didn't have a sense of how long that might take. If you checked
> >>>> ntptime after the job failed, it may be that ntp had enough time. We
> >>>> can probably consider bumping up the allowable error.
> >>>>
> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple <[email protected]>
> wrote:
> >>>>> This is the second time I have seen it, but it doesn't happen every
> >>>>> time. It could very well be a difference on ec2; already I've seen
> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone while most
> >>>>> Impala developers work in America/Los_Angeles.
> >>>>>
> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs <[email protected]>
> wrote:
> >>>>>> No problem. If this happens again we should ask the Kudu
> developers. I
> >>>>>> haven't seen this before - I wonder if it could be some weirdness on
> >>>>>> ec2...
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple <[email protected]>
> wrote:
> >>>>>>> Thank you for your help!
> >>>>>>>
> >>>>>>> This was on an AWS machine that has expired, but I can see from the
> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and
> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading
> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL
> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264-
> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2-
> package-ubuntu-14-04.tar.gz".
> >>>>>>> I'll add "ps aux | grep kudu" to the logging this machine does on
> >>>>>>> error, so we'll have it next time, but I did "ps -Afly" on exit and
> >>>>>>> there were no kudu processes running, it looks like.
> >>>>>>>
> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs <[email protected]>
> wrote:
> >>>>>>>> Can you check which version of the client you're building against
> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running (ps aux |
> grep
> >>>>>>>> kudu
> >>>>>>>>
> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple <[email protected]>
> wrote:
> >>>>>>>>> Yes.
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs <[email protected]>
> wrote:
> >>>>>>>>>> Do you have NTP installed?
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple <
> [email protected]> wrote:
> >>>>>>>>>>> I have a machine where Kudu failed to start:
> >>>>>>>>>>>
> >>>>>>>>>>> F1116 05:02:00.173629 71098 tablet_server_main.cc:64] Check
> failed:
> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot initialize
> clock:
> >>>>>>>>>>> Error reading clock. Clock considered unsynchronized
> >>>>>>>>>>>
> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html says:
> >>>>>>>>>>>
> >>>>>>>>>>> "For the master and tablet server daemons, the server’s clock
> must be
> >>>>>>>>>>> synchronized using NTP. In addition, the maximum clock error
> (not to
> >>>>>>>>>>> be mistaken with the estimated error) be below a configurable
> >>>>>>>>>>> threshold. The default value is 10 seconds, but it can be set
> with the
> >>>>>>>>>>> flag --max_clock_sync_error_usec."
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> "If NTP is installed the user can monitor the synchronization
> status
> >>>>>>>>>>> by running ntptime. The relevant value is what is reported for
> maximum
> >>>>>>>>>>> error."
> >>>>>>>>>>>
> >>>>>>>>>>> ntptime reports:
> >>>>>>>>>>>
> >>>>>>>>>>> ntp_gettime() returns code 0 (OK)
> >>>>>>>>>>>   time dbd66a6a.59bca948  Wed, Nov 16 2016  5:17:30.350,
> (.350535824),
> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015 us, TAI
> offset 0
> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK)
> >>>>>>>>>>>   modes 0x0 (),
> >>>>>>>>>>>   offset 74989.459 us, frequency 19.950 ppm, interval 1 s,
> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015 us,
> >>>>>>>>>>>   status 0x2001 (PLL,NANO),
> >>>>>>>>>>>   time constant 6, precision 0.001 us, tolerance 500 ppm,
> >>>>>>>>>>>
> >>>>>>>>>>> So it looks like this error is anticipated, but the expected
> >>>>>>>>>>> conditions for it to occur are absent. Any ideas what could be
> going
> >>>>>>>>>>> on here? This is with a recent checkout of Impala master.
>

Reply via email to