On Sat, Nov 19, 2016 at 10:50 AM, Jim Apple <[email protected]> wrote:
> Can you do so? > > This job used ntpwait before starting and has syslog sent to the > output; perhaps that will help debug: > > http://ec2-35-161-220-160.us-west-2.compute.amazonaws.com:80 > 80/job/ubuntu-14.04-from-scratch/147/ > > Yea, looks from the last ntptime output that it lost its NTP synchronization. Looking at the startup log it appears that it's just configured with the default Ubuntu NTP servers rather than using multiple hosts inside EC2. Is it possible for you to adjust your AMI or change the startup script to use the following lines in /etc/ntp.conf? server 0.amazon.pool.ntp.org server 1.amazon.pool.ntp.org server 2.amazon.pool.ntp.org server 3.amazon.pool.ntp.org We could add a flag to ignore NTP sync, but it looks like once NTP goes out of sync it's also reporting a maxerror of 16 seconds, which would also cause an issue. -Todd > In the meantime, it seems we could use a workaround in order to get > this working in ec2 more consistently. > > On Thu, Nov 17, 2016 at 12:21 PM, Todd Lipcon <[email protected]> wrote: > > Nothing else comes to mind. Let me know if you think the solution of > > allowing Kudu to ignore NTP status is preferable and I'll prioritize a > > patch to unblock you guys. > > > > -Todd > > > > On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <[email protected]> > wrote: > > > >> This is a normal functional test. The node is gone now, but I will add > >> "cat /var/log/syslog" to the logging. Anything else I should add? > >> > >> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <[email protected]> > wrote: > >> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <[email protected]> > >> wrote: > >> > > >> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and > >> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),". > >> >> > >> > > >> > Do you see anything from ntp in /var/log/messages or syslog, etc? > Usually > >> > once NTP is synchronized it's fairly hard to knock it out of sync > unless > >> it > >> > loses internet access. Is this a stress test or normal functional > tests? > >> > > >> > > >> >> > >> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]> > >> wrote: > >> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]> > >> >> wrote: > >> >> > > >> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562 > 113049 > >> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock > >> >> >> unsynchronized. Status: Service unavailable: Error reading clock. > >> >> >> Clock considered unsynchronized" > >> >> >> > >> >> >> Todd, when you say "we need to wait for sync before running tests > or > >> >> >> allow an unsafe flag to disable the check," do you mean "we" > Impala, > >> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should > >> >> >> follow? I don't see a likely candidate at > >> >> >> > >> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql= > >> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock% > >> >> >> 20considered%20unsynchronized%5C%22%22 > >> >> > > >> >> > > >> >> > Sorry, I was on my phone so wasn't as precise as I should have > been. I > >> >> > think that either: > >> >> > > >> >> > (a) Impala could add a script which runs prior to tests which waits > >> for > >> >> > synchronization. Apparently there is a builtin "ntp-wait" command > in > >> some > >> >> > distros. We use the following which also includes restarting NTP a > few > >> >> > times which seems to speed things up: > >> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b > 97dfe48 > >> >> > > >> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which > you > >> >> could > >> >> > use in minicluster tests. In a minicluster where all the daemons > run > >> on > >> >> one > >> >> > node, it's sort of guaranteed that the clocks are in sync, so maybe > >> it's > >> >> > not too bad of an idea. The only risk I can see is that the clock > >> might > >> >> be > >> >> > adjusted backwards if NTP is not in sync status yet, and we may > have > >> some > >> >> > assertions which would be unhappy if that happened. We could give > it a > >> >> try, > >> >> > though, if the "wait-for-ntp" approach doesn't help. > >> >> > > >> >> > -Todd > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> >> > >> >> >> > >> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]> > >> wrote: > >> >> >> > This particular error message isn't related to the clock error > >> itself. > >> >> >> It's > >> >> >> > the sync status flag which indicates whether NTP thinks its > phase > >> lock > >> >> >> loop > >> >> >> > is in sync or not. So, changing the allowed error shouldn't have > >> any > >> >> >> > effect. Instead we need to wait for sync before running tests or > >> >> allow an > >> >> >> > unsafe flag to disable the check. > >> >> >> > > >> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> > wrote: > >> >> >> > > >> >> >> >> Yeah, I don't doubt that the ntp output when you checked it was > >> >> wrong, > >> >> >> >> but it may be different from the time when Kudu failed to > start, > >> >> since > >> >> >> >> that was probably at least several minutes later (assuming you > >> had to > >> >> >> >> find out about the failed job, ssh into the machine, discover > the > >> ntp > >> >> >> >> warning and then check it at that time). > >> >> >> >> > >> >> >> >> I agree adding some logging at the beginning of jobs is the > best > >> >> thing > >> >> >> >> to do now. > >> >> >> >> > >> >> >> >> Thanks! > >> >> >> >> > >> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple < > [email protected] > >> > > >> >> >> wrote: > >> >> >> >> > I found a place where I suppose that could go > >> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm > not > >> sure > >> >> >> >> > what I want to set it at. If "maximum error 197431 us" is to > be > >> >> >> >> > believed, it needs to be at least 0.2 seconds, and if > >> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html is to be > >> >> believed, > >> >> >> >> > it is already 10 seconds. > >> >> >> >> > > >> >> >> >> > I added some more logging to try and see what is going on, > and > >> if I > >> >> >> >> > hit this again I'll re-open the thread. For now, I am not > >> planning > >> >> to > >> >> >> >> > submit a patch to "fix" it because I'm not sure increasing > the > >> >> number > >> >> >> >> > is the real solution. > >> >> >> >> > > >> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs < > >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >> According to the error message, it looks like we can specify > >> the > >> >> >> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu > >> >> processes. > >> >> >> >> >> We may want to start by printing ntptime output at the > >> beginning > >> >> of > >> >> >> >> >> jobs so we can see how far off it is. If it's off by days > then > >> >> maybe > >> >> >> >> >> changing the error isn't a good idea, and we'll need to > figure > >> out > >> >> >> >> >> something else. > >> >> >> >> >> > >> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple < > >> [email protected] > >> >> > > >> >> >> >> wrote: > >> >> >> >> >>> How do we bump up the allowable error? > >> >> >> >> >>> > >> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs < > >> >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >>>> I asked on the Kudu slack channel, they have seen issues > >> where > >> >> >> freshly > >> >> >> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce, > but > >> >> they > >> >> >> >> >>>> didn't have a sense of how long that might take. If you > >> checked > >> >> >> >> >>>> ntptime after the job failed, it may be that ntp had > enough > >> >> time. > >> >> >> We > >> >> >> >> >>>> can probably consider bumping up the allowable error. > >> >> >> >> >>>> > >> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple < > >> >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >>>>> This is the second time I have seen it, but it doesn't > >> happen > >> >> >> every > >> >> >> >> >>>>> time. It could very well be a difference on ec2; already > >> I've > >> >> seen > >> >> >> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone > >> while > >> >> >> most > >> >> >> >> >>>>> Impala developers work in America/Los_Angeles. > >> >> >> >> >>>>> > >> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs < > >> >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >>>>>> No problem. If this happens again we should ask the Kudu > >> >> >> >> developers. I > >> >> >> >> >>>>>> haven't seen this before - I wonder if it could be some > >> >> >> weirdness on > >> >> >> >> >>>>>> ec2... > >> >> >> >> >>>>>> > >> >> >> >> >>>>>> Thanks > >> >> >> >> >>>>>> > >> >> >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple < > >> >> [email protected] > >> >> >> > > >> >> >> >> wrote: > >> >> >> >> >>>>>>> Thank you for your help! > >> >> >> >> >>>>>>> > >> >> >> >> >>>>>>> This was on an AWS machine that has expired, but I can > see > >> >> from > >> >> >> the > >> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and > >> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading > >> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL > >> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264- > >> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2- > >> >> >> >> package-ubuntu-14-04.tar.gz". > >> >> >> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this > machine > >> >> does > >> >> >> on > >> >> >> >> >>>>>>> error, so we'll have it next time, but I did "ps > -Afly" on > >> >> exit > >> >> >> and > >> >> >> >> >>>>>>> there were no kudu processes running, it looks like. > >> >> >> >> >>>>>>> > >> >> >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs < > >> >> >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >>>>>>>> Can you check which version of the client you're > building > >> >> >> against > >> >> >> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running > >> (ps > >> >> aux > >> >> >> | > >> >> >> >> grep > >> >> >> >> >>>>>>>> kudu > >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple < > >> >> >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >>>>>>>>> Yes. > >> >> >> >> >>>>>>>>> > >> >> >> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs < > >> >> >> [email protected]> > >> >> >> >> wrote: > >> >> >> >> >>>>>>>>>> Do you have NTP installed? > >> >> >> >> >>>>>>>>>> > >> >> >> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple < > >> >> >> >> [email protected]> wrote: > >> >> >> >> >>>>>>>>>>> I have a machine where Kudu failed to start: > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> F1116 05:02:00.173629 71098 > tablet_server_main.cc:64] > >> >> Check > >> >> >> >> failed: > >> >> >> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot > >> >> initialize > >> >> >> >> clock: > >> >> >> >> >>>>>>>>>>> Error reading clock. Clock considered > unsynchronized > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html > >> says: > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> "For the master and tablet server daemons, the > >> server’s > >> >> >> clock > >> >> >> >> must be > >> >> >> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum > clock > >> >> error > >> >> >> >> (not to > >> >> >> >> >>>>>>>>>>> be mistaken with the estimated error) be below a > >> >> >> configurable > >> >> >> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it > >> can be > >> >> >> set > >> >> >> >> with the > >> >> >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec." > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> and > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> "If NTP is installed the user can monitor the > >> >> >> synchronization > >> >> >> >> status > >> >> >> >> >>>>>>>>>>> by running ntptime. The relevant value is what is > >> >> reported > >> >> >> for > >> >> >> >> maximum > >> >> >> >> >>>>>>>>>>> error." > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> ntptime reports: > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK) > >> >> >> >> >>>>>>>>>>> time dbd66a6a.59bca948 Wed, Nov 16 2016 > >> 5:17:30.350, > >> >> >> >> (.350535824), > >> >> >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 > us, > >> TAI > >> >> >> >> offset 0 > >> >> >> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK) > >> >> >> >> >>>>>>>>>>> modes 0x0 (), > >> >> >> >> >>>>>>>>>>> offset 74989.459 us, frequency 19.950 ppm, > interval > >> 1 > >> >> s, > >> >> >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 > us, > >> >> >> >> >>>>>>>>>>> status 0x2001 (PLL,NANO), > >> >> >> >> >>>>>>>>>>> time constant 6, precision 0.001 us, tolerance > 500 > >> ppm, > >> >> >> >> >>>>>>>>>>> > >> >> >> >> >>>>>>>>>>> So it looks like this error is anticipated, but the > >> >> expected > >> >> >> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas > what > >> >> could > >> >> >> be > >> >> >> >> going > >> >> >> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala > >> master. > >> >> >> >> > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Todd Lipcon > >> >> > Software Engineer, Cloudera > >> >> > >> > > >> > > >> > > >> > -- > >> > Todd Lipcon > >> > Software Engineer, Cloudera > >> > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
