That didn't do it: http://35.164.73.121:8080/job/ubuntu-14.04-from-scratch/184/consoleFull
The setup of the machine (which runs before the jobs starts and is not visible at that page) includes: + sudo sed -i s/ubuntu\.pool/amazon\.pool/ /etc/ntp.conf + grep amazon /etc/ntp.conf server 0.amazon.pool.ntp.org server 1.amazon.pool.ntp.org server 2.amazon.pool.ntp.org server 3.amazon.pool.ntp.org + grep ubuntu /etc/ntp.conf server ntp.ubuntu.com As before, it looks like ntptime is reporting a max error of 16 seconds. I'm going to try and add ntp-wait to our cluster start scripts to make sure that ntp is in a good state the moment before we try to start Kudu. On Sat, Nov 26, 2016 at 1:51 PM, Jim Apple <[email protected]> wrote: > OK, I changed those lines in /etc/ntp.conf; let's see how that goes. > > On Mon, Nov 21, 2016 at 7:19 AM, Todd Lipcon <[email protected]> wrote: >> On Sat, Nov 19, 2016 at 10:50 AM, Jim Apple <[email protected]> wrote: >> >>> Can you do so? >>> >>> This job used ntpwait before starting and has syslog sent to the >>> output; perhaps that will help debug: >>> >>> http://ec2-35-161-220-160.us-west-2.compute.amazonaws.com:80 >>> 80/job/ubuntu-14.04-from-scratch/147/ >>> >>> >> Yea, looks from the last ntptime output that it lost its NTP >> synchronization. Looking at the startup log it appears that it's just >> configured with the default Ubuntu NTP servers rather than using multiple >> hosts inside EC2. Is it possible for you to adjust your AMI or change the >> startup script to use the following lines in /etc/ntp.conf? >> >> server 0.amazon.pool.ntp.org >> server 1.amazon.pool.ntp.org >> server 2.amazon.pool.ntp.org >> server 3.amazon.pool.ntp.org >> >> We could add a flag to ignore NTP sync, but it looks like once NTP goes out >> of sync it's also reporting a maxerror of 16 seconds, which would also >> cause an issue. >> >> -Todd >> >> >>> In the meantime, it seems we could use a workaround in order to get >>> this working in ec2 more consistently. >>> >>> On Thu, Nov 17, 2016 at 12:21 PM, Todd Lipcon <[email protected]> wrote: >>> > Nothing else comes to mind. Let me know if you think the solution of >>> > allowing Kudu to ignore NTP status is preferable and I'll prioritize a >>> > patch to unblock you guys. >>> > >>> > -Todd >>> > >>> > On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <[email protected]> >>> wrote: >>> > >>> >> This is a normal functional test. The node is gone now, but I will add >>> >> "cat /var/log/syslog" to the logging. Anything else I should add? >>> >> >>> >> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <[email protected]> >>> wrote: >>> >> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <[email protected]> >>> >> wrote: >>> >> > >>> >> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and >>> >> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),". >>> >> >> >>> >> > >>> >> > Do you see anything from ntp in /var/log/messages or syslog, etc? >>> Usually >>> >> > once NTP is synchronized it's fairly hard to knock it out of sync >>> unless >>> >> it >>> >> > loses internet access. Is this a stress test or normal functional >>> tests? >>> >> > >>> >> > >>> >> >> >>> >> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]> >>> >> wrote: >>> >> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]> >>> >> >> wrote: >>> >> >> > >>> >> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562 >>> 113049 >>> >> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock >>> >> >> >> unsynchronized. Status: Service unavailable: Error reading clock. >>> >> >> >> Clock considered unsynchronized" >>> >> >> >> >>> >> >> >> Todd, when you say "we need to wait for sync before running tests >>> or >>> >> >> >> allow an unsafe flag to disable the check," do you mean "we" >>> Impala, >>> >> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should >>> >> >> >> follow? I don't see a likely candidate at >>> >> >> >> >>> >> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql= >>> >> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock% >>> >> >> >> 20considered%20unsynchronized%5C%22%22 >>> >> >> > >>> >> >> > >>> >> >> > Sorry, I was on my phone so wasn't as precise as I should have >>> been. I >>> >> >> > think that either: >>> >> >> > >>> >> >> > (a) Impala could add a script which runs prior to tests which waits >>> >> for >>> >> >> > synchronization. Apparently there is a builtin "ntp-wait" command >>> in >>> >> some >>> >> >> > distros. We use the following which also includes restarting NTP a >>> few >>> >> >> > times which seems to speed things up: >>> >> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b >>> 97dfe48 >>> >> >> > >>> >> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which >>> you >>> >> >> could >>> >> >> > use in minicluster tests. In a minicluster where all the daemons >>> run >>> >> on >>> >> >> one >>> >> >> > node, it's sort of guaranteed that the clocks are in sync, so maybe >>> >> it's >>> >> >> > not too bad of an idea. The only risk I can see is that the clock >>> >> might >>> >> >> be >>> >> >> > adjusted backwards if NTP is not in sync status yet, and we may >>> have >>> >> some >>> >> >> > assertions which would be unhappy if that happened. We could give >>> it a >>> >> >> try, >>> >> >> > though, if the "wait-for-ntp" approach doesn't help. >>> >> >> > >>> >> >> > -Todd >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> >> >>> >> >> >> >>> >> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]> >>> >> wrote: >>> >> >> >> > This particular error message isn't related to the clock error >>> >> itself. >>> >> >> >> It's >>> >> >> >> > the sync status flag which indicates whether NTP thinks its >>> phase >>> >> lock >>> >> >> >> loop >>> >> >> >> > is in sync or not. So, changing the allowed error shouldn't have >>> >> any >>> >> >> >> > effect. Instead we need to wait for sync before running tests or >>> >> >> allow an >>> >> >> >> > unsafe flag to disable the check. >>> >> >> >> > >>> >> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> >>> wrote: >>> >> >> >> > >>> >> >> >> >> Yeah, I don't doubt that the ntp output when you checked it was >>> >> >> wrong, >>> >> >> >> >> but it may be different from the time when Kudu failed to >>> start, >>> >> >> since >>> >> >> >> >> that was probably at least several minutes later (assuming you >>> >> had to >>> >> >> >> >> find out about the failed job, ssh into the machine, discover >>> the >>> >> ntp >>> >> >> >> >> warning and then check it at that time). >>> >> >> >> >> >>> >> >> >> >> I agree adding some logging at the beginning of jobs is the >>> best >>> >> >> thing >>> >> >> >> >> to do now. >>> >> >> >> >> >>> >> >> >> >> Thanks! >>> >> >> >> >> >>> >> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple < >>> [email protected] >>> >> > >>> >> >> >> wrote: >>> >> >> >> >> > I found a place where I suppose that could go >>> >> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm >>> not >>> >> sure >>> >> >> >> >> > what I want to set it at. If "maximum error 197431 us" is to >>> be >>> >> >> >> >> > believed, it needs to be at least 0.2 seconds, and if >>> >> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html is to be >>> >> >> believed, >>> >> >> >> >> > it is already 10 seconds. >>> >> >> >> >> > >>> >> >> >> >> > I added some more logging to try and see what is going on, >>> and >>> >> if I >>> >> >> >> >> > hit this again I'll re-open the thread. For now, I am not >>> >> planning >>> >> >> to >>> >> >> >> >> > submit a patch to "fix" it because I'm not sure increasing >>> the >>> >> >> number >>> >> >> >> >> > is the real solution. >>> >> >> >> >> > >>> >> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs < >>> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >> According to the error message, it looks like we can specify >>> >> the >>> >> >> >> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu >>> >> >> processes. >>> >> >> >> >> >> We may want to start by printing ntptime output at the >>> >> beginning >>> >> >> of >>> >> >> >> >> >> jobs so we can see how far off it is. If it's off by days >>> then >>> >> >> maybe >>> >> >> >> >> >> changing the error isn't a good idea, and we'll need to >>> figure >>> >> out >>> >> >> >> >> >> something else. >>> >> >> >> >> >> >>> >> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple < >>> >> [email protected] >>> >> >> > >>> >> >> >> >> wrote: >>> >> >> >> >> >>> How do we bump up the allowable error? >>> >> >> >> >> >>> >>> >> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs < >>> >> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >>>> I asked on the Kudu slack channel, they have seen issues >>> >> where >>> >> >> >> freshly >>> >> >> >> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce, >>> but >>> >> >> they >>> >> >> >> >> >>>> didn't have a sense of how long that might take. If you >>> >> checked >>> >> >> >> >> >>>> ntptime after the job failed, it may be that ntp had >>> enough >>> >> >> time. >>> >> >> >> We >>> >> >> >> >> >>>> can probably consider bumping up the allowable error. >>> >> >> >> >> >>>> >>> >> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple < >>> >> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >>>>> This is the second time I have seen it, but it doesn't >>> >> happen >>> >> >> >> every >>> >> >> >> >> >>>>> time. It could very well be a difference on ec2; already >>> >> I've >>> >> >> seen >>> >> >> >> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone >>> >> while >>> >> >> >> most >>> >> >> >> >> >>>>> Impala developers work in America/Los_Angeles. >>> >> >> >> >> >>>>> >>> >> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs < >>> >> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >>>>>> No problem. If this happens again we should ask the Kudu >>> >> >> >> >> developers. I >>> >> >> >> >> >>>>>> haven't seen this before - I wonder if it could be some >>> >> >> >> weirdness on >>> >> >> >> >> >>>>>> ec2... >>> >> >> >> >> >>>>>> >>> >> >> >> >> >>>>>> Thanks >>> >> >> >> >> >>>>>> >>> >> >> >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple < >>> >> >> [email protected] >>> >> >> >> > >>> >> >> >> >> wrote: >>> >> >> >> >> >>>>>>> Thank you for your help! >>> >> >> >> >> >>>>>>> >>> >> >> >> >> >>>>>>> This was on an AWS machine that has expired, but I can >>> see >>> >> >> from >>> >> >> >> the >>> >> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and >>> >> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading >>> >> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL >>> >> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264- >>> >> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2- >>> >> >> >> >> package-ubuntu-14-04.tar.gz". >>> >> >> >> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this >>> machine >>> >> >> does >>> >> >> >> on >>> >> >> >> >> >>>>>>> error, so we'll have it next time, but I did "ps >>> -Afly" on >>> >> >> exit >>> >> >> >> and >>> >> >> >> >> >>>>>>> there were no kudu processes running, it looks like. >>> >> >> >> >> >>>>>>> >>> >> >> >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs < >>> >> >> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >>>>>>>> Can you check which version of the client you're >>> building >>> >> >> >> against >>> >> >> >> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running >>> >> (ps >>> >> >> aux >>> >> >> >> | >>> >> >> >> >> grep >>> >> >> >> >> >>>>>>>> kudu >>> >> >> >> >> >>>>>>>> >>> >> >> >> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple < >>> >> >> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >>>>>>>>> Yes. >>> >> >> >> >> >>>>>>>>> >>> >> >> >> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs < >>> >> >> >> [email protected]> >>> >> >> >> >> wrote: >>> >> >> >> >> >>>>>>>>>> Do you have NTP installed? >>> >> >> >> >> >>>>>>>>>> >>> >> >> >> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple < >>> >> >> >> >> [email protected]> wrote: >>> >> >> >> >> >>>>>>>>>>> I have a machine where Kudu failed to start: >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> F1116 05:02:00.173629 71098 >>> tablet_server_main.cc:64] >>> >> >> Check >>> >> >> >> >> failed: >>> >> >> >> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot >>> >> >> initialize >>> >> >> >> >> clock: >>> >> >> >> >> >>>>>>>>>>> Error reading clock. Clock considered >>> unsynchronized >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html >>> >> says: >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> "For the master and tablet server daemons, the >>> >> server’s >>> >> >> >> clock >>> >> >> >> >> must be >>> >> >> >> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum >>> clock >>> >> >> error >>> >> >> >> >> (not to >>> >> >> >> >> >>>>>>>>>>> be mistaken with the estimated error) be below a >>> >> >> >> configurable >>> >> >> >> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it >>> >> can be >>> >> >> >> set >>> >> >> >> >> with the >>> >> >> >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec." >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> and >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> "If NTP is installed the user can monitor the >>> >> >> >> synchronization >>> >> >> >> >> status >>> >> >> >> >> >>>>>>>>>>> by running ntptime. The relevant value is what is >>> >> >> reported >>> >> >> >> for >>> >> >> >> >> maximum >>> >> >> >> >> >>>>>>>>>>> error." >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> ntptime reports: >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK) >>> >> >> >> >> >>>>>>>>>>> time dbd66a6a.59bca948 Wed, Nov 16 2016 >>> >> 5:17:30.350, >>> >> >> >> >> (.350535824), >>> >> >> >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 >>> us, >>> >> TAI >>> >> >> >> >> offset 0 >>> >> >> >> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK) >>> >> >> >> >> >>>>>>>>>>> modes 0x0 (), >>> >> >> >> >> >>>>>>>>>>> offset 74989.459 us, frequency 19.950 ppm, >>> interval >>> >> 1 >>> >> >> s, >>> >> >> >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 >>> us, >>> >> >> >> >> >>>>>>>>>>> status 0x2001 (PLL,NANO), >>> >> >> >> >> >>>>>>>>>>> time constant 6, precision 0.001 us, tolerance >>> 500 >>> >> ppm, >>> >> >> >> >> >>>>>>>>>>> >>> >> >> >> >> >>>>>>>>>>> So it looks like this error is anticipated, but the >>> >> >> expected >>> >> >> >> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas >>> what >>> >> >> could >>> >> >> >> be >>> >> >> >> >> going >>> >> >> >> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala >>> >> master. >>> >> >> >> >> >>> >> >> >> >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > -- >>> >> >> > Todd Lipcon >>> >> >> > Software Engineer, Cloudera >>> >> >> >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Todd Lipcon >>> >> > Software Engineer, Cloudera >>> >> >>> > >>> > >>> > >>> > -- >>> > Todd Lipcon >>> > Software Engineer, Cloudera >>> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera
