Nothing else comes to mind. Let me know if you think the solution of
allowing Kudu to ignore NTP status is preferable and I'll prioritize a
patch to unblock you guys.

-Todd

On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <[email protected]> wrote:

> This is a normal functional test. The node is gone now, but I will add
> "cat /var/log/syslog" to the logging. Anything else I should add?
>
> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <[email protected]> wrote:
> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <[email protected]>
> wrote:
> >
> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and
> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),".
> >>
> >
> > Do you see anything from ntp in /var/log/messages or syslog, etc? Usually
> > once NTP is synchronized it's fairly hard to knock it out of sync unless
> it
> > loses internet access. Is this a stress test or normal functional tests?
> >
> >
> >>
> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]>
> wrote:
> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]>
> >> wrote:
> >> >
> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562 113049
> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock
> >> >> unsynchronized. Status: Service unavailable: Error reading clock.
> >> >> Clock considered unsynchronized"
> >> >>
> >> >> Todd, when you say "we need to wait for sync before running tests or
> >> >> allow an unsafe flag to disable the check," do you mean "we" Impala,
> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should
> >> >> follow? I don't see a likely candidate at
> >> >>
> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql=
> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock%
> >> >> 20considered%20unsynchronized%5C%22%22
> >> >
> >> >
> >> > Sorry, I was on my phone so wasn't as precise as I should have been. I
> >> > think that either:
> >> >
> >> > (a) Impala could add a script which runs prior to tests which waits
> for
> >> > synchronization. Apparently there is a builtin "ntp-wait" command in
> some
> >> > distros. We use the following which also includes restarting NTP a few
> >> > times which seems to speed things up:
> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b97dfe48
> >> >
> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which you
> >> could
> >> > use in minicluster tests. In a minicluster where all the daemons run
> on
> >> one
> >> > node, it's sort of guaranteed that the clocks are in sync, so maybe
> it's
> >> > not too bad of an idea. The only risk I can see is that the clock
> might
> >> be
> >> > adjusted backwards if NTP is not in sync status yet, and we may have
> some
> >> > assertions which would be unhappy if that happened. We could give it a
> >> try,
> >> > though, if the "wait-for-ntp" approach doesn't help.
> >> >
> >> > -Todd
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >>
> >> >>
> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]>
> wrote:
> >> >> > This particular error message isn't related to the clock error
> itself.
> >> >> It's
> >> >> > the sync status flag which indicates whether NTP thinks its phase
> lock
> >> >> loop
> >> >> > is in sync or not. So, changing the allowed error shouldn't have
> any
> >> >> > effect. Instead we need to wait for sync before running tests or
> >> allow an
> >> >> > unsafe flag to disable the check.
> >> >> >
> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> wrote:
> >> >> >
> >> >> >> Yeah, I don't doubt that the ntp output when you checked it was
> >> wrong,
> >> >> >> but it may be different from the time when Kudu failed to start,
> >> since
> >> >> >> that was probably at least several minutes later (assuming you
> had to
> >> >> >> find out about the failed job, ssh into the machine, discover the
> ntp
> >> >> >> warning and then check it at that time).
> >> >> >>
> >> >> >> I agree adding some logging at the beginning of jobs is the best
> >> thing
> >> >> >> to do now.
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <[email protected]
> >
> >> >> wrote:
> >> >> >> > I found a place where I suppose that could go
> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm not
> sure
> >> >> >> > what I want to set it at. If "maximum error 197431 us" is to be
> >> >> >> > believed, it needs to be at least 0.2 seconds, and if
> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html is to be
> >> believed,
> >> >> >> > it is already 10 seconds.
> >> >> >> >
> >> >> >> > I added some more logging to try and see what is going on, and
> if I
> >> >> >> > hit this again I'll re-open the thread. For now, I am not
> planning
> >> to
> >> >> >> > submit a patch to "fix" it because I'm not sure increasing the
> >> number
> >> >> >> > is the real solution.
> >> >> >> >
> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs <
> [email protected]>
> >> >> >> wrote:
> >> >> >> >> According to the error message, it looks like we can specify
> the
> >> >> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu
> >> processes.
> >> >> >> >> We may want to start by printing ntptime output at the
> beginning
> >> of
> >> >> >> >> jobs so we can see how far off it is. If it's off by days then
> >> maybe
> >> >> >> >> changing the error isn't a good idea, and we'll need to figure
> out
> >> >> >> >> something else.
> >> >> >> >>
> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <
> [email protected]
> >> >
> >> >> >> wrote:
> >> >> >> >>> How do we bump up the allowable error?
> >> >> >> >>>
> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs <
> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>>> I asked on the Kudu slack channel, they have seen issues
> where
> >> >> freshly
> >> >> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce, but
> >> they
> >> >> >> >>>> didn't have a sense of how long that might take. If you
> checked
> >> >> >> >>>> ntptime after the job failed, it may be that ntp had enough
> >> time.
> >> >> We
> >> >> >> >>>> can probably consider bumping up the allowable error.
> >> >> >> >>>>
> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple <
> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>>>> This is the second time I have seen it, but it doesn't
> happen
> >> >> every
> >> >> >> >>>>> time. It could very well be a difference on ec2; already
> I've
> >> seen
> >> >> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone
> while
> >> >> most
> >> >> >> >>>>> Impala developers work in America/Los_Angeles.
> >> >> >> >>>>>
> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs <
> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>>>>> No problem. If this happens again we should ask the Kudu
> >> >> >> developers. I
> >> >> >> >>>>>> haven't seen this before - I wonder if it could be some
> >> >> weirdness on
> >> >> >> >>>>>> ec2...
> >> >> >> >>>>>>
> >> >> >> >>>>>> Thanks
> >> >> >> >>>>>>
> >> >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple <
> >> [email protected]
> >> >> >
> >> >> >> wrote:
> >> >> >> >>>>>>> Thank you for your help!
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> This was on an AWS machine that has expired, but I can see
> >> from
> >> >> the
> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and
> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading
> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL
> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264-
> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2-
> >> >> >> package-ubuntu-14-04.tar.gz".
> >> >> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this machine
> >> does
> >> >> on
> >> >> >> >>>>>>> error, so we'll have it next time, but I did "ps -Afly" on
> >> exit
> >> >> and
> >> >> >> >>>>>>> there were no kudu processes running, it looks like.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs <
> >> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>>>>>>> Can you check which version of the client you're building
> >> >> against
> >> >> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running
> (ps
> >> aux
> >> >> |
> >> >> >> grep
> >> >> >> >>>>>>>> kudu
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple <
> >> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>>>>>>>> Yes.
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs <
> >> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>>>>>>>>> Do you have NTP installed?
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple <
> >> >> >> [email protected]> wrote:
> >> >> >> >>>>>>>>>>> I have a machine where Kudu failed to start:
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> F1116 05:02:00.173629 71098 tablet_server_main.cc:64]
> >> Check
> >> >> >> failed:
> >> >> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot
> >> initialize
> >> >> >> clock:
> >> >> >> >>>>>>>>>>> Error reading clock. Clock considered unsynchronized
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html
> says:
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> "For the master and tablet server daemons, the
> server’s
> >> >> clock
> >> >> >> must be
> >> >> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum clock
> >> error
> >> >> >> (not to
> >> >> >> >>>>>>>>>>> be mistaken with the estimated error) be below a
> >> >> configurable
> >> >> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it
> can be
> >> >> set
> >> >> >> with the
> >> >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec."
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> and
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> "If NTP is installed the user can monitor the
> >> >> synchronization
> >> >> >> status
> >> >> >> >>>>>>>>>>> by running ntptime. The relevant value is what is
> >> reported
> >> >> for
> >> >> >> maximum
> >> >> >> >>>>>>>>>>> error."
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> ntptime reports:
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK)
> >> >> >> >>>>>>>>>>>   time dbd66a6a.59bca948  Wed, Nov 16 2016
> 5:17:30.350,
> >> >> >> (.350535824),
> >> >> >> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015 us,
> TAI
> >> >> >> offset 0
> >> >> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK)
> >> >> >> >>>>>>>>>>>   modes 0x0 (),
> >> >> >> >>>>>>>>>>>   offset 74989.459 us, frequency 19.950 ppm, interval
> 1
> >> s,
> >> >> >> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015 us,
> >> >> >> >>>>>>>>>>>   status 0x2001 (PLL,NANO),
> >> >> >> >>>>>>>>>>>   time constant 6, precision 0.001 us, tolerance 500
> ppm,
> >> >> >> >>>>>>>>>>>
> >> >> >> >>>>>>>>>>> So it looks like this error is anticipated, but the
> >> expected
> >> >> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas what
> >> could
> >> >> be
> >> >> >> going
> >> >> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala
> master.
> >> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Todd Lipcon
> >> > Software Engineer, Cloudera
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to