[
https://issues.apache.org/jira/browse/KUDU-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-2988:
------------------------------
Component/s: test
> built-in NTP client: sometimes minichronyd fails to start with address
> already in use
> -------------------------------------------------------------------------------------
>
> Key: KUDU-2988
> URL: https://issues.apache.org/jira/browse/KUDU-2988
> Project: Kudu
> Issue Type: Sub-task
> Components: clock, ntp-client, test
> Affects Versions: 1.11.0
> Reporter: Adar Dembo
> Priority: Major
> Labels: clock
>
> From time to time some tests that use the built-in NTP client and MiniChronyd
> fail. The failure usually looks like this:
> {noformat}
> I1013 22:02:28.429188 23364 hybrid_clock.cc:162] waiting up to
> --ntp_initial_sync_wait_secs=10 seconds for the clock to synchronize
> I1013 22:02:38.430480 23364 builtin_ntp.cc:552] server 127.16.18.212:42817:
> addresses=127.16.18.212:42817 current_address=127.16.18.212:42817
> i_pkt_total_num=0 i_pkt_valid_num=0 o_pkt_total_num=20 o_pkt_timedout_num=14
> is_synchronized=false
> last_mono=0
> last_wall=0
> last_error=0
> now_mono=264459136130
> F1013 22:02:38.430524 23364 master_main.cc:105] Check failed: _s.ok() Bad
> status: Service unavailable: Cannot initialize clock: timed out waiting for
> clock synchronisation: wallclock is not synchronized: no valid NTP responses
> yet
> *** Check failure stack trace: ***
> {noformat}
> I've also seen one failure like this:
> {noformat}
> [ RUN ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0
> 2019-11-01T00:37:31Z chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK
> +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -SECHASH -IPV6 +DEBUG)
> 2019-11-01T00:37:32Z Disabled control of system clock
> Could not open connection to daemon
> W1101 00:37:32.553658 235 mini_chronyd.cc:189] Time spent starting chronyd:
> real 1.110s user 0.000s sys 0.011s
> 2019-11-01T00:37:32Z chronyd exiting
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/external_mini_cluster-itest-base.cc:66:
> Failure
> Failed
> Bad status: Timed out: failed to start NTP server 0: failed to contact
> chronyd in 1.000s
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/ts_recovery-itest.cc:428:
> Failure
> Expected: StartClusterOneTs({ "--fault_crash_during_log_replay=0.05" })
> doesn't generate new fatal failures in the current thread.
> Actual: it does.
> I1101 00:37:32.564697 235 external_mini_cluster-itest-base.cc:80] Found
> fatal failure
> I1101 00:37:32.566167 235 test_util.cc:136]
> -----------------------------------------------
> I1101 00:37:32.566222 235 test_util.cc:137] Had fatal failures, leaving
> test files at
> /tmp/dist-test-taskBEeRDj/test-tmp/ts_recovery-itest.0.BlockManagerType_TsRecoveryITest.TestCrashDuringLogReplay_0.1572568627695135-235
> [ FAILED ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0,
> where GetParam() = "file" (1125 ms)
> {noformat}
> In the first case it's pretty odd that despite the fact that MiniChronyd
> failed to bind to a socket, we managed to convince ourselves that it was up
> and running. My running theory is that chronyd doesn't exit when this happens
> and still listens on the UNIX domain socket for our "server stats" query,
> which then convinces us that chronyd is alive and well.
> Anyway, this is the cause of some test flakiness, so we should look into it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)