On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <[email protected]> wrote:
> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and > 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),". > Do you see anything from ntp in /var/log/messages or syslog, etc? Usually once NTP is synchronized it's fairly hard to knock it out of sync unless it loses internet access. Is this a stress test or normal functional tests? > > On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <[email protected]> wrote: > > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <[email protected]> > wrote: > > > >> I now have a Kudu master failing with "F1117 16:36:26.940562 113049 > >> hybrid_clock.cc:227] Couldn't get the current time: Clock > >> unsynchronized. Status: Service unavailable: Error reading clock. > >> Clock considered unsynchronized" > >> > >> Todd, when you say "we need to wait for sync before running tests or > >> allow an unsafe flag to disable the check," do you mean "we" Impala, > >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should > >> follow? I don't see a likely candidate at > >> > >> https://issues.apache.org/jira/browse/KUDU-1202?jql= > >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock% > >> 20considered%20unsynchronized%5C%22%22 > > > > > > Sorry, I was on my phone so wasn't as precise as I should have been. I > > think that either: > > > > (a) Impala could add a script which runs prior to tests which waits for > > synchronization. Apparently there is a builtin "ntp-wait" command in some > > distros. We use the following which also includes restarting NTP a few > > times which seems to speed things up: > > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b97dfe48 > > > > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which you > could > > use in minicluster tests. In a minicluster where all the daemons run on > one > > node, it's sort of guaranteed that the clocks are in sync, so maybe it's > > not too bad of an idea. The only risk I can see is that the clock might > be > > adjusted backwards if NTP is not in sync status yet, and we may have some > > assertions which would be unhappy if that happened. We could give it a > try, > > though, if the "wait-for-ntp" approach doesn't help. > > > > -Todd > > > > > > > > > > > > > >> > >> > >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <[email protected]> wrote: > >> > This particular error message isn't related to the clock error itself. > >> It's > >> > the sync status flag which indicates whether NTP thinks its phase lock > >> loop > >> > is in sync or not. So, changing the allowed error shouldn't have any > >> > effect. Instead we need to wait for sync before running tests or > allow an > >> > unsafe flag to disable the check. > >> > > >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <[email protected]> wrote: > >> > > >> >> Yeah, I don't doubt that the ntp output when you checked it was > wrong, > >> >> but it may be different from the time when Kudu failed to start, > since > >> >> that was probably at least several minutes later (assuming you had to > >> >> find out about the failed job, ssh into the machine, discover the ntp > >> >> warning and then check it at that time). > >> >> > >> >> I agree adding some logging at the beginning of jobs is the best > thing > >> >> to do now. > >> >> > >> >> Thanks! > >> >> > >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <[email protected]> > >> wrote: > >> >> > I found a place where I suppose that could go > >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm not sure > >> >> > what I want to set it at. If "maximum error 197431 us" is to be > >> >> > believed, it needs to be at least 0.2 seconds, and if > >> >> > https://kudu.apache.org/docs/troubleshooting.html is to be > believed, > >> >> > it is already 10 seconds. > >> >> > > >> >> > I added some more logging to try and see what is going on, and if I > >> >> > hit this again I'll re-open the thread. For now, I am not planning > to > >> >> > submit a patch to "fix" it because I'm not sure increasing the > number > >> >> > is the real solution. > >> >> > > >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs <[email protected]> > >> >> wrote: > >> >> >> According to the error message, it looks like we can specify the > >> >> >> '--max_clock_sync_error_usec' flag when starting the Kudu > processes. > >> >> >> We may want to start by printing ntptime output at the beginning > of > >> >> >> jobs so we can see how far off it is. If it's off by days then > maybe > >> >> >> changing the error isn't a good idea, and we'll need to figure out > >> >> >> something else. > >> >> >> > >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <[email protected] > > > >> >> wrote: > >> >> >>> How do we bump up the allowable error? > >> >> >>> > >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs < > [email protected]> > >> >> wrote: > >> >> >>>> I asked on the Kudu slack channel, they have seen issues where > >> freshly > >> >> >>>> provisioned ec2 nodes take some time for ntp to quiesce, but > they > >> >> >>>> didn't have a sense of how long that might take. If you checked > >> >> >>>> ntptime after the job failed, it may be that ntp had enough > time. > >> We > >> >> >>>> can probably consider bumping up the allowable error. > >> >> >>>> > >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple < > [email protected]> > >> >> wrote: > >> >> >>>>> This is the second time I have seen it, but it doesn't happen > >> every > >> >> >>>>> time. It could very well be a difference on ec2; already I've > seen > >> >> >>>>> some bugs due to my ec2 instances being Etc/UTC timezone while > >> most > >> >> >>>>> Impala developers work in America/Los_Angeles. > >> >> >>>>> > >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs < > [email protected]> > >> >> wrote: > >> >> >>>>>> No problem. If this happens again we should ask the Kudu > >> >> developers. I > >> >> >>>>>> haven't seen this before - I wonder if it could be some > >> weirdness on > >> >> >>>>>> ec2... > >> >> >>>>>> > >> >> >>>>>> Thanks > >> >> >>>>>> > >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple < > [email protected] > >> > > >> >> wrote: > >> >> >>>>>>> Thank you for your help! > >> >> >>>>>>> > >> >> >>>>>>> This was on an AWS machine that has expired, but I can see > from > >> the > >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and > >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading > >> >> >>>>>>> kudu-python-0.3.0.tar.gz" and "URL > >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264- > >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2- > >> >> package-ubuntu-14-04.tar.gz". > >> >> >>>>>>> I'll add "ps aux | grep kudu" to the logging this machine > does > >> on > >> >> >>>>>>> error, so we'll have it next time, but I did "ps -Afly" on > exit > >> and > >> >> >>>>>>> there were no kudu processes running, it looks like. > >> >> >>>>>>> > >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs < > >> [email protected]> > >> >> wrote: > >> >> >>>>>>>> Can you check which version of the client you're building > >> against > >> >> >>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running (ps > aux > >> | > >> >> grep > >> >> >>>>>>>> kudu > >> >> >>>>>>>> > >> >> >>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple < > >> [email protected]> > >> >> wrote: > >> >> >>>>>>>>> Yes. > >> >> >>>>>>>>> > >> >> >>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs < > >> [email protected]> > >> >> wrote: > >> >> >>>>>>>>>> Do you have NTP installed? > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple < > >> >> [email protected]> wrote: > >> >> >>>>>>>>>>> I have a machine where Kudu failed to start: > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> F1116 05:02:00.173629 71098 tablet_server_main.cc:64] > Check > >> >> failed: > >> >> >>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot > initialize > >> >> clock: > >> >> >>>>>>>>>>> Error reading clock. Clock considered unsynchronized > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html says: > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> "For the master and tablet server daemons, the server’s > >> clock > >> >> must be > >> >> >>>>>>>>>>> synchronized using NTP. In addition, the maximum clock > error > >> >> (not to > >> >> >>>>>>>>>>> be mistaken with the estimated error) be below a > >> configurable > >> >> >>>>>>>>>>> threshold. The default value is 10 seconds, but it can be > >> set > >> >> with the > >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec." > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> and > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> "If NTP is installed the user can monitor the > >> synchronization > >> >> status > >> >> >>>>>>>>>>> by running ntptime. The relevant value is what is > reported > >> for > >> >> maximum > >> >> >>>>>>>>>>> error." > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> ntptime reports: > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> ntp_gettime() returns code 0 (OK) > >> >> >>>>>>>>>>> time dbd66a6a.59bca948 Wed, Nov 16 2016 5:17:30.350, > >> >> (.350535824), > >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 us, TAI > >> >> offset 0 > >> >> >>>>>>>>>>> ntp_adjtime() returns code 0 (OK) > >> >> >>>>>>>>>>> modes 0x0 (), > >> >> >>>>>>>>>>> offset 74989.459 us, frequency 19.950 ppm, interval 1 > s, > >> >> >>>>>>>>>>> maximum error 197431 us, estimated error 71015 us, > >> >> >>>>>>>>>>> status 0x2001 (PLL,NANO), > >> >> >>>>>>>>>>> time constant 6, precision 0.001 us, tolerance 500 ppm, > >> >> >>>>>>>>>>> > >> >> >>>>>>>>>>> So it looks like this error is anticipated, but the > expected > >> >> >>>>>>>>>>> conditions for it to occur are absent. Any ideas what > could > >> be > >> >> going > >> >> >>>>>>>>>>> on here? This is with a recent checkout of Impala master. > >> >> > >> > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
