OK, I changed those lines in /etc/ntp.conf; let's see how that goes.
On Mon, Nov 21, 2016 at 7:19 AM, Todd Lipcon <[email protected]> wrote: > On Sat, Nov 19, 2016 at 10:50 AM, Jim Apple <[email protected]> wrote: > >> Can you do so? >> >> This job used ntpwait before starting and has syslog sent to the >> output; perhaps that will help debug: >> >> http://ec2-35-161-220-160.us-west-2.compute.amazonaws.com:80 >> 80/job/ubuntu-14.04-from-scratch/147/ >> >> > Yea, looks from the last ntptime output that it lost its NTP > synchronization. Looking at the startup log it appears that it's just > configured with the default Ubuntu NTP servers rather than using multiple > hosts inside EC2. Is it possible for you to adjust your AMI or change the > startup script to use the following lines in /etc/ntp.conf? > > server 0.amazon.pool.ntp.org > server 1.amazon.pool.ntp.org > server 2.amazon.pool.ntp.org > server 3.amazon.pool.ntp.org > > We could add a flag to ignore NTP sync, but it looks like once NTP goes out > of sync it's also reporting a maxerror of 16 seconds, which would also > cause an issue. > > -Todd > > >> In the meantime, it seems we could use a workaround in order to get >> this working in ec2 more consistently. >> >> On Thu, Nov 17, 2016 at 12:21 PM, Todd Lipcon <[email protected]> wrote: >> > Nothing else comes to mind. Let me know if you think the solution of >> > allowing Kudu to ignore NTP status is preferable and I'll prioritize a >> > patch to unblock you guys. >> > >> > -Todd >> > >> > On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <[email protected]> >> wrote: >> > >> >> This is a normal functional test. The node is gone now, but I will add >> >> "cat /var/log/syslog" to the logging. Anything else I should add? >> >> >> >> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <[email protected]> >> wrote: >> >> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <[email protected]> >> >> wrote: >> >> > >> >> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and >> >> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),". >> >> >> >> >> > >> >> > Do you see anything from ntp in /var/log/messages or syslog, etc? >> Usually >> >> > once NTP is synchronized it's fairly hard to knock it out of sync >> unless >> >> it >> >> > loses internet access. Is this a stress test or normal functional >> tests? >> >> > >> >> > >> >> >> >> >> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]> >> >> wrote: >> >> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]> >> >> >> wrote: >> >> >> > >> >> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562 >> 113049 >> >> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock >> >> >> >> unsynchronized. Status: Service unavailable: Error reading clock. >> >> >> >> Clock considered unsynchronized" >> >> >> >> >> >> >> >> Todd, when you say "we need to wait for sync before running tests >> or >> >> >> >> allow an unsafe flag to disable the check," do you mean "we" >> Impala, >> >> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should >> >> >> >> follow? I don't see a likely candidate at >> >> >> >> >> >> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql= >> >> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock% >> >> >> >> 20considered%20unsynchronized%5C%22%22 >> >> >> > >> >> >> > >> >> >> > Sorry, I was on my phone so wasn't as precise as I should have >> been. I >> >> >> > think that either: >> >> >> > >> >> >> > (a) Impala could add a script which runs prior to tests which waits >> >> for >> >> >> > synchronization. Apparently there is a builtin "ntp-wait" command >> in >> >> some >> >> >> > distros. We use the following which also includes restarting NTP a >> few >> >> >> > times which seems to speed things up: >> >> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b >> 97dfe48 >> >> >> > >> >> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which >> you >> >> >> could >> >> >> > use in minicluster tests. In a minicluster where all the daemons >> run >> >> on >> >> >> one >> >> >> > node, it's sort of guaranteed that the clocks are in sync, so maybe >> >> it's >> >> >> > not too bad of an idea. The only risk I can see is that the clock >> >> might >> >> >> be >> >> >> > adjusted backwards if NTP is not in sync status yet, and we may >> have >> >> some >> >> >> > assertions which would be unhappy if that happened. We could give >> it a >> >> >> try, >> >> >> > though, if the "wait-for-ntp" approach doesn't help. >> >> >> > >> >> >> > -Todd >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]> >> >> wrote: >> >> >> >> > This particular error message isn't related to the clock error >> >> itself. >> >> >> >> It's >> >> >> >> > the sync status flag which indicates whether NTP thinks its >> phase >> >> lock >> >> >> >> loop >> >> >> >> > is in sync or not. So, changing the allowed error shouldn't have >> >> any >> >> >> >> > effect. Instead we need to wait for sync before running tests or >> >> >> allow an >> >> >> >> > unsafe flag to disable the check. >> >> >> >> > >> >> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> >> wrote: >> >> >> >> > >> >> >> >> >> Yeah, I don't doubt that the ntp output when you checked it was >> >> >> wrong, >> >> >> >> >> but it may be different from the time when Kudu failed to >> start, >> >> >> since >> >> >> >> >> that was probably at least several minutes later (assuming you >> >> had to >> >> >> >> >> find out about the failed job, ssh into the machine, discover >> the >> >> ntp >> >> >> >> >> warning and then check it at that time). >> >> >> >> >> >> >> >> >> >> I agree adding some logging at the beginning of jobs is the >> best >> >> >> thing >> >> >> >> >> to do now. >> >> >> >> >> >> >> >> >> >> Thanks! >> >> >> >> >> >> >> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple < >> [email protected] >> >> > >> >> >> >> wrote: >> >> >> >> >> > I found a place where I suppose that could go >> >> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm >> not >> >> sure >> >> >> >> >> > what I want to set it at. If "maximum error 197431 us" is to >> be >> >> >> >> >> > believed, it needs to be at least 0.2 seconds, and if >> >> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html is to be >> >> >> believed, >> >> >> >> >> > it is already 10 seconds. >> >> >> >> >> > >> >> >> >> >> > I added some more logging to try and see what is going on, >> and >> >> if I >> >> >> >> >> > hit this again I'll re-open the thread. For now, I am not >> >> planning >> >> >> to >> >> >> >> >> > submit a patch to "fix" it because I'm not sure increasing >> the >> >> >> number >> >> >> >> >> > is the real solution. >> >> >> >> >> > >> >> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs < >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >> According to the error message, it looks like we can specify >> >> the >> >> >> >> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu >> >> >> processes. >> >> >> >> >> >> We may want to start by printing ntptime output at the >> >> beginning >> >> >> of >> >> >> >> >> >> jobs so we can see how far off it is. If it's off by days >> then >> >> >> maybe >> >> >> >> >> >> changing the error isn't a good idea, and we'll need to >> figure >> >> out >> >> >> >> >> >> something else. >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple < >> >> [email protected] >> >> >> > >> >> >> >> >> wrote: >> >> >> >> >> >>> How do we bump up the allowable error? >> >> >> >> >> >>> >> >> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs < >> >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >>>> I asked on the Kudu slack channel, they have seen issues >> >> where >> >> >> >> freshly >> >> >> >> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce, >> but >> >> >> they >> >> >> >> >> >>>> didn't have a sense of how long that might take. If you >> >> checked >> >> >> >> >> >>>> ntptime after the job failed, it may be that ntp had >> enough >> >> >> time. >> >> >> >> We >> >> >> >> >> >>>> can probably consider bumping up the allowable error. >> >> >> >> >> >>>> >> >> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple < >> >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >>>>> This is the second time I have seen it, but it doesn't >> >> happen >> >> >> >> every >> >> >> >> >> >>>>> time. It could very well be a difference on ec2; already >> >> I've >> >> >> seen >> >> >> >> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone >> >> while >> >> >> >> most >> >> >> >> >> >>>>> Impala developers work in America/Los_Angeles. >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs < >> >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >>>>>> No problem. If this happens again we should ask the Kudu >> >> >> >> >> developers. I >> >> >> >> >> >>>>>> haven't seen this before - I wonder if it could be some >> >> >> >> weirdness on >> >> >> >> >> >>>>>> ec2... >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> Thanks >> >> >> >> >> >>>>>> >> >> >> >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple < >> >> >> [email protected] >> >> >> >> > >> >> >> >> >> wrote: >> >> >> >> >> >>>>>>> Thank you for your help! >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> This was on an AWS machine that has expired, but I can >> see >> >> >> from >> >> >> >> the >> >> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and >> >> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading >> >> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL >> >> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264- >> >> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2- >> >> >> >> >> package-ubuntu-14-04.tar.gz". >> >> >> >> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this >> machine >> >> >> does >> >> >> >> on >> >> >> >> >> >>>>>>> error, so we'll have it next time, but I did "ps >> -Afly" on >> >> >> exit >> >> >> >> and >> >> >> >> >> >>>>>>> there were no kudu processes running, it looks like. >> >> >> >> >> >>>>>>> >> >> >> >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs < >> >> >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >>>>>>>> Can you check which version of the client you're >> building >> >> >> >> against >> >> >> >> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running >> >> (ps >> >> >> aux >> >> >> >> | >> >> >> >> >> grep >> >> >> >> >> >>>>>>>> kudu >> >> >> >> >> >>>>>>>> >> >> >> >> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple < >> >> >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >>>>>>>>> Yes. >> >> >> >> >> >>>>>>>>> >> >> >> >> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs < >> >> >> >> [email protected]> >> >> >> >> >> wrote: >> >> >> >> >> >>>>>>>>>> Do you have NTP installed? >> >> >> >> >> >>>>>>>>>> >> >> >> >> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple < >> >> >> >> >> [email protected]> wrote: >> >> >> >> >> >>>>>>>>>>> I have a machine where Kudu failed to start: >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> F1116 05:02:00.173629 71098 >> tablet_server_main.cc:64] >> >> >> Check >> >> >> >> >> failed: >> >> >> >> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot >> >> >> initialize >> >> >> >> >> clock: >> >> >> >> >> >>>>>>>>>>> Error reading clock. Clock considered >> unsynchronized >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html >> >> says: >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> "For the master and tablet server daemons, the >> >> server’s >> >> >> >> clock >> >> >> >> >> must be >> >> >> >> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum >> clock >> >> >> error >> >> >> >> >> (not to >> >> >> >> >> >>>>>>>>>>> be mistaken with the estimated error) be below a >> >> >> >> configurable >> >> >> >> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it >> >> can be >> >> >> >> set >> >> >> >> >> with the >> >> >> >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec." >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> and >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> "If NTP is installed the user can monitor the >> >> >> >> synchronization >> >> >> >> >> status >> >> >> >> >> >>>>>>>>>>> by running ntptime. The relevant value is what is >> >> >> reported >> >> >> >> for >> >> >> >> >> maximum >> >> >> >> >> >>>>>>>>>>> error." >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> ntptime reports: >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK) >> >> >> >> >> >>>>>>>>>>> time dbd66a6a.59bca948 Wed, Nov 16 2016 >> >> 5:17:30.350, >> >> >> >> >> (.350535824), >> >> >> >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 >> us, >> >> TAI >> >> >> >> >> offset 0 >> >> >> >> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK) >> >> >> >> >> >>>>>>>>>>> modes 0x0 (), >> >> >> >> >> >>>>>>>>>>> offset 74989.459 us, frequency 19.950 ppm, >> interval >> >> 1 >> >> >> s, >> >> >> >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 >> us, >> >> >> >> >> >>>>>>>>>>> status 0x2001 (PLL,NANO), >> >> >> >> >> >>>>>>>>>>> time constant 6, precision 0.001 us, tolerance >> 500 >> >> ppm, >> >> >> >> >> >>>>>>>>>>> >> >> >> >> >> >>>>>>>>>>> So it looks like this error is anticipated, but the >> >> >> expected >> >> >> >> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas >> what >> >> >> could >> >> >> >> be >> >> >> >> >> going >> >> >> >> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala >> >> master. >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Todd Lipcon >> >> >> > Software Engineer, Cloudera >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Todd Lipcon >> >> > Software Engineer, Cloudera >> >> >> > >> > >> > >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera
