Re: Kudu start error with low ntpdate "maximum error"

Jim Apple Tue, 29 Nov 2016 09:05:00 -0800

That didn't do it:

http://35.164.73.121:8080/job/ubuntu-14.04-from-scratch/184/consoleFull



The setup of the machine (which runs before the jobs starts and is not
visible at that page) includes:

+ sudo sed -i s/ubuntu\.pool/amazon\.pool/ /etc/ntp.conf
+ grep amazon /etc/ntp.conf
server 0.amazon.pool.ntp.org
server 1.amazon.pool.ntp.org
server 2.amazon.pool.ntp.org
server 3.amazon.pool.ntp.org
+ grep ubuntu /etc/ntp.conf
server ntp.ubuntu.com

As before, it looks like ntptime is reporting a max error of 16
seconds. I'm going to try and add ntp-wait to our cluster start
scripts to make sure that ntp is in a good state the moment before we
try to start Kudu.

On Sat, Nov 26, 2016 at 1:51 PM, Jim Apple <[email protected]> wrote:
> OK, I changed those lines in /etc/ntp.conf; let's see how that goes.
>
> On Mon, Nov 21, 2016 at 7:19 AM, Todd Lipcon <[email protected]> wrote:
>> On Sat, Nov 19, 2016 at 10:50 AM, Jim Apple <[email protected]> wrote:
>>
>>> Can you do so?
>>>
>>> This job used ntpwait before starting and has syslog sent to the
>>> output; perhaps that will help debug:
>>>
>>> http://ec2-35-161-220-160.us-west-2.compute.amazonaws.com:80
>>> 80/job/ubuntu-14.04-from-scratch/147/
>>>
>>>
>> Yea, looks from the last ntptime output that it lost its NTP
>> synchronization. Looking at the startup log it appears that it's just
>> configured with the default Ubuntu NTP servers rather than using multiple
>> hosts inside EC2. Is it possible for you to adjust your AMI or change the
>> startup script to use the following lines in /etc/ntp.conf?
>>
>> server 0.amazon.pool.ntp.org
>> server 1.amazon.pool.ntp.org
>> server 2.amazon.pool.ntp.org
>> server 3.amazon.pool.ntp.org
>>
>> We could add a flag to ignore NTP sync, but it looks like once NTP goes out
>> of sync it's also reporting a maxerror of 16 seconds, which would also
>> cause an issue.
>>
>> -Todd
>>
>>
>>> In the meantime, it seems we could use a workaround in order to get
>>> this working in ec2 more consistently.
>>>
>>> On Thu, Nov 17, 2016 at 12:21 PM, Todd Lipcon <[email protected]> wrote:
>>> > Nothing else comes to mind. Let me know if you think the solution of
>>> > allowing Kudu to ignore NTP status is preferable and I'll prioritize a
>>> > patch to unblock you guys.
>>> >
>>> > -Todd
>>> >
>>> > On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <[email protected]>
>>> wrote:
>>> >
>>> >> This is a normal functional test. The node is gone now, but I will add
>>> >> "cat /var/log/syslog" to the logging. Anything else I should add?
>>> >>
>>> >> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <[email protected]>
>>> wrote:
>>> >> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <[email protected]>
>>> >> wrote:
>>> >> >
>>> >> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and
>>> >> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),".
>>> >> >>
>>> >> >
>>> >> > Do you see anything from ntp in /var/log/messages or syslog, etc?
>>> Usually
>>> >> > once NTP is synchronized it's fairly hard to knock it out of sync
>>> unless
>>> >> it
>>> >> > loses internet access. Is this a stress test or normal functional
>>> tests?
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]>
>>> >> wrote:
>>> >> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]>
>>> >> >> wrote:
>>> >> >> >
>>> >> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562
>>> 113049
>>> >> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock
>>> >> >> >> unsynchronized. Status: Service unavailable: Error reading clock.
>>> >> >> >> Clock considered unsynchronized"
>>> >> >> >>
>>> >> >> >> Todd, when you say "we need to wait for sync before running tests
>>> or
>>> >> >> >> allow an unsafe flag to disable the check," do you mean "we"
>>> Impala,
>>> >> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should
>>> >> >> >> follow? I don't see a likely candidate at
>>> >> >> >>
>>> >> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql=
>>> >> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock%
>>> >> >> >> 20considered%20unsynchronized%5C%22%22
>>> >> >> >
>>> >> >> >
>>> >> >> > Sorry, I was on my phone so wasn't as precise as I should have
>>> been. I
>>> >> >> > think that either:
>>> >> >> >
>>> >> >> > (a) Impala could add a script which runs prior to tests which waits
>>> >> for
>>> >> >> > synchronization. Apparently there is a builtin "ntp-wait" command
>>> in
>>> >> some
>>> >> >> > distros. We use the following which also includes restarting NTP a
>>> few
>>> >> >> > times which seems to speed things up:
>>> >> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b
>>> 97dfe48
>>> >> >> >
>>> >> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which
>>> you
>>> >> >> could
>>> >> >> > use in minicluster tests. In a minicluster where all the daemons
>>> run
>>> >> on
>>> >> >> one
>>> >> >> > node, it's sort of guaranteed that the clocks are in sync, so maybe
>>> >> it's
>>> >> >> > not too bad of an idea. The only risk I can see is that the clock
>>> >> might
>>> >> >> be
>>> >> >> > adjusted backwards if NTP is not in sync status yet, and we may
>>> have
>>> >> some
>>> >> >> > assertions which would be unhappy if that happened. We could give
>>> it a
>>> >> >> try,
>>> >> >> > though, if the "wait-for-ntp" approach doesn't help.
>>> >> >> >
>>> >> >> > -Todd
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]>
>>> >> wrote:
>>> >> >> >> > This particular error message isn't related to the clock error
>>> >> itself.
>>> >> >> >> It's
>>> >> >> >> > the sync status flag which indicates whether NTP thinks its
>>> phase
>>> >> lock
>>> >> >> >> loop
>>> >> >> >> > is in sync or not. So, changing the allowed error shouldn't have
>>> >> any
>>> >> >> >> > effect. Instead we need to wait for sync before running tests or
>>> >> >> allow an
>>> >> >> >> > unsafe flag to disable the check.
>>> >> >> >> >
>>> >> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]>
>>> wrote:
>>> >> >> >> >
>>> >> >> >> >> Yeah, I don't doubt that the ntp output when you checked it was
>>> >> >> wrong,
>>> >> >> >> >> but it may be different from the time when Kudu failed to
>>> start,
>>> >> >> since
>>> >> >> >> >> that was probably at least several minutes later (assuming you
>>> >> had to
>>> >> >> >> >> find out about the failed job, ssh into the machine, discover
>>> the
>>> >> ntp
>>> >> >> >> >> warning and then check it at that time).
>>> >> >> >> >>
>>> >> >> >> >> I agree adding some logging at the beginning of jobs is the
>>> best
>>> >> >> thing
>>> >> >> >> >> to do now.
>>> >> >> >> >>
>>> >> >> >> >> Thanks!
>>> >> >> >> >>
>>> >> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <
>>> [email protected]
>>> >> >
>>> >> >> >> wrote:
>>> >> >> >> >> > I found a place where I suppose that could go
>>> >> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm
>>> not
>>> >> sure
>>> >> >> >> >> > what I want to set it at. If "maximum error 197431 us" is to
>>> be
>>> >> >> >> >> > believed, it needs to be at least 0.2 seconds, and if
>>> >> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html is to be
>>> >> >> believed,
>>> >> >> >> >> > it is already 10 seconds.
>>> >> >> >> >> >
>>> >> >> >> >> > I added some more logging to try and see what is going on,
>>> and
>>> >> if I
>>> >> >> >> >> > hit this again I'll re-open the thread. For now, I am not
>>> >> planning
>>> >> >> to
>>> >> >> >> >> > submit a patch to "fix" it because I'm not sure increasing
>>> the
>>> >> >> number
>>> >> >> >> >> > is the real solution.
>>> >> >> >> >> >
>>> >> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs <
>>> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >> According to the error message, it looks like we can specify
>>> >> the
>>> >> >> >> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu
>>> >> >> processes.
>>> >> >> >> >> >> We may want to start by printing ntptime output at the
>>> >> beginning
>>> >> >> of
>>> >> >> >> >> >> jobs so we can see how far off it is. If it's off by days
>>> then
>>> >> >> maybe
>>> >> >> >> >> >> changing the error isn't a good idea, and we'll need to
>>> figure
>>> >> out
>>> >> >> >> >> >> something else.
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <
>>> >> [email protected]
>>> >> >> >
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>> How do we bump up the allowable error?
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs <
>>> >> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>> I asked on the Kudu slack channel, they have seen issues
>>> >> where
>>> >> >> >> freshly
>>> >> >> >> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce,
>>> but
>>> >> >> they
>>> >> >> >> >> >>>> didn't have a sense of how long that might take. If you
>>> >> checked
>>> >> >> >> >> >>>> ntptime after the job failed, it may be that ntp had
>>> enough
>>> >> >> time.
>>> >> >> >> We
>>> >> >> >> >> >>>> can probably consider bumping up the allowable error.
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple <
>>> >> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>>> This is the second time I have seen it, but it doesn't
>>> >> happen
>>> >> >> >> every
>>> >> >> >> >> >>>>> time. It could very well be a difference on ec2; already
>>> >> I've
>>> >> >> seen
>>> >> >> >> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone
>>> >> while
>>> >> >> >> most
>>> >> >> >> >> >>>>> Impala developers work in America/Los_Angeles.
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs <
>>> >> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>>>> No problem. If this happens again we should ask the Kudu
>>> >> >> >> >> developers. I
>>> >> >> >> >> >>>>>> haven't seen this before - I wonder if it could be some
>>> >> >> >> weirdness on
>>> >> >> >> >> >>>>>> ec2...
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> Thanks
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple <
>>> >> >> [email protected]
>>> >> >> >> >
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>>>>> Thank you for your help!
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> This was on an AWS machine that has expired, but I can
>>> see
>>> >> >> from
>>> >> >> >> the
>>> >> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and
>>> >> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading
>>> >> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL
>>> >> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264-
>>> >> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2-
>>> >> >> >> >> package-ubuntu-14-04.tar.gz".
>>> >> >> >> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this
>>> machine
>>> >> >> does
>>> >> >> >> on
>>> >> >> >> >> >>>>>>> error, so we'll have it next time, but I did "ps
>>> -Afly" on
>>> >> >> exit
>>> >> >> >> and
>>> >> >> >> >> >>>>>>> there were no kudu processes running, it looks like.
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs <
>>> >> >> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>>>>>> Can you check which version of the client you're
>>> building
>>> >> >> >> against
>>> >> >> >> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running
>>> >> (ps
>>> >> >> aux
>>> >> >> >> |
>>> >> >> >> >> grep
>>> >> >> >> >> >>>>>>>> kudu
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple <
>>> >> >> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>>>>>>> Yes.
>>> >> >> >> >> >>>>>>>>>
>>> >> >> >> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs <
>>> >> >> >> [email protected]>
>>> >> >> >> >> wrote:
>>> >> >> >> >> >>>>>>>>>> Do you have NTP installed?
>>> >> >> >> >> >>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple <
>>> >> >> >> >> [email protected]> wrote:
>>> >> >> >> >> >>>>>>>>>>> I have a machine where Kudu failed to start:
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> F1116 05:02:00.173629 71098
>>> tablet_server_main.cc:64]
>>> >> >> Check
>>> >> >> >> >> failed:
>>> >> >> >> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot
>>> >> >> initialize
>>> >> >> >> >> clock:
>>> >> >> >> >> >>>>>>>>>>> Error reading clock. Clock considered
>>> unsynchronized
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html
>>> >> says:
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> "For the master and tablet server daemons, the
>>> >> server’s
>>> >> >> >> clock
>>> >> >> >> >> must be
>>> >> >> >> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum
>>> clock
>>> >> >> error
>>> >> >> >> >> (not to
>>> >> >> >> >> >>>>>>>>>>> be mistaken with the estimated error) be below a
>>> >> >> >> configurable
>>> >> >> >> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it
>>> >> can be
>>> >> >> >> set
>>> >> >> >> >> with the
>>> >> >> >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec."
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> and
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> "If NTP is installed the user can monitor the
>>> >> >> >> synchronization
>>> >> >> >> >> status
>>> >> >> >> >> >>>>>>>>>>> by running ntptime. The relevant value is what is
>>> >> >> reported
>>> >> >> >> for
>>> >> >> >> >> maximum
>>> >> >> >> >> >>>>>>>>>>> error."
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> ntptime reports:
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK)
>>> >> >> >> >> >>>>>>>>>>>   time dbd66a6a.59bca948  Wed, Nov 16 2016
>>> >> 5:17:30.350,
>>> >> >> >> >> (.350535824),
>>> >> >> >> >> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015
>>> us,
>>> >> TAI
>>> >> >> >> >> offset 0
>>> >> >> >> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK)
>>> >> >> >> >> >>>>>>>>>>>   modes 0x0 (),
>>> >> >> >> >> >>>>>>>>>>>   offset 74989.459 us, frequency 19.950 ppm,
>>> interval
>>> >> 1
>>> >> >> s,
>>> >> >> >> >> >>>>>>>>>>>   maximum error 197431 us, estimated error 71015
>>> us,
>>> >> >> >> >> >>>>>>>>>>>   status 0x2001 (PLL,NANO),
>>> >> >> >> >> >>>>>>>>>>>   time constant 6, precision 0.001 us, tolerance
>>> 500
>>> >> ppm,
>>> >> >> >> >> >>>>>>>>>>>
>>> >> >> >> >> >>>>>>>>>>> So it looks like this error is anticipated, but the
>>> >> >> expected
>>> >> >> >> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas
>>> what
>>> >> >> could
>>> >> >> >> be
>>> >> >> >> >> going
>>> >> >> >> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala
>>> >> master.
>>> >> >> >> >>
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Todd Lipcon
>>> >> >> > Software Engineer, Cloudera
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Todd Lipcon
>>> >> > Software Engineer, Cloudera
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Todd Lipcon
>>> > Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera

Re: Kudu start error with low ntpdate "maximum error"

Reply via email to