Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/7477 )
Change subject: [clock] add a built-in NTP client implementation ...................................................................... Patch Set 16: (50 comments) I didn't read any RFCs so there's a fair amount of missing coverage there (would be great if Todd could take a look since he originally implemented this). http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h File src/kudu/clock/builtin_ntp.h: http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h@45 PS13, Line 45: class BuiltInNtp : public TimeService { Could use some more documentation, at least for the class (i.e. what are we trying to achieve with this?) and the private members. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h@73 PS13, Line 73: bool TryReceivePacket(); Should doc (amongst other things) what the return value means. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h@92 PS13, Line 92: int64_t last_compute_mono_ = 0; : int64_t last_compute_wall_ = 0; : int64_t last_compute_error_ = 0; : bool is_synchronized_ = false; Might want to encapsulate these in a struct so it's easy to make a local copy with the lock held. http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc File src/kudu/clock/builtin_ntp.cc: http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@78 PS9, Line 78: TAG_FLAG(builtin_ntp_request_timeout_ms, experimental); > maybe this should be a flag? Done http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@286 PS9, Line 286: : // The monotonic timestamp wh > I seem to recall I looked into how to find this info and determined that it Do we need to convert this into a TODO, or a note? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@385 PS9, Line 385: } > i seem to recall we now have some more general "wait for NTP at startup' th Yeah, it looks like Todd's reimplementation of this logic remained and is now duplicated here and in system_ntp.cc. Could we consolidate them, perhaps in HybridClock::Init? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@474 PS9, Line 474: > maybe we should include the size of the packet, or the packet itself? Alexey, what do you think? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@509 PS9, Line 509: { : MutexLock l(state_lock_); > I guess we should read the RFC carefully and figure this out Alexey, did you explore this comment? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@523 PS9, Line 523: // will get EPIPE. : socket_.Shutdown(true, true); > improve this error Done http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@536 PS9, Line 536: // unclear what clock those come from > probably worth doing this Alexey, what do you think of this? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@583 PS9, Line 583: // 3. The Originate Timestamp in the server reply should match the > should resolve this too Alexey, did you work this out? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@666 PS9, Line 666: // Originate Timestamp T1 time request sent by client > seems like we could probably factor this out and make it testable +1, Alexey could you do this? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@754 PS9, Line 754: > yea todo Done http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc File src/kudu/clock/builtin_ntp.cc: http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@58 PS13, Line 58: "0.ubuntu.pool.ntp.org," : "1.ubuntu.pool.ntp.org," : "2.ubuntu.pool.ntp.org," : "3.ubuntu.pool.ntp.org", Shouldn't we default to [0-3].pool.ntp.org? As per https://www.ntppool.org/en/use.html. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@66 PS13, Line 66: : // As of 2019, recent version of ntpd allows setting minpoll as low as 3 : // (2^3 = 8 seconds), and chronyd supports minpoll as low as to -6 : // (2^-6 = 1/64 second), so 16 seconds looks like a reasonble default. Do we also want to express this in terms of powers of 2? Is that an interface users are more comfortable with? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@87 PS13, Line 87: DEFINE_int32(ntp_initial_sync_wait_secs, 60, Why did this move from one time service impl to another? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@167 PS13, Line 167: // NOTE: this is used as a nonce, not the actual time. Does this comment belong near the transmit times? I'm thinking of CreateClientPacket's behavior. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@351 PS13, Line 351: currently Nit: already said current before, remove one of these. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@352 PS13, Line 352: If we : // run out of addresses, we will re-resolve. This doesn't appear to be implemented yet; move to a TODO? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@370 PS13, Line 370: Status BuiltInNtp::CreateFromFlags(unique_ptr<TimeService>* ntp) { Since Init() is required and already returns a Status, you could defer the ParseStrings call to Init(), to enable the use of a simpler constructor as in the other TimeService implementations. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@374 PS13, Line 374: --builtin-ntp-servers Nit: --builtin_ntp_servers http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@382 PS13, Line 382: Nit: remove empty line http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@407 PS13, Line 407: CHECK_OK(to_bind.ParseString("0.0.0.0", 0)); Do we need to make this configurable? Do we always want to bind to every interface? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@411 PS13, Line 411: RETURN_NOT_OK_PREPEND(socket_.SetRecvTimeout(MonoDelta::FromSeconds(0.5)), : "couldn't set socket recv timeout"); Why is this important? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@433 PS13, Line 433: return Status::Incomplete(Substitute( Why not TimedOut? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@440 PS13, Line 440: return Status::OK(); Do you think we should return non-OK if state_ transitioned to kShutdown (i.e. concurrent thread called Shutdown)? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@458 PS13, Line 458: return Status::IllegalState("wallclock is not synchronized", I think the system NTP implementation returns ServiceUnavailable when not synchronized. Maybe we should use that here and on L470? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@489 PS13, Line 489: for (const auto& s : servers_) { Do we need to protect any server state with a lock? What if the poll thread simultaneously writes to a member while another thread is in DumpDiagnostics? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@515 PS13, Line 515: state_ = kShutdown; Maybe we should release state_lock_ after this? No need to hold it during the remaining syscalls, right? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@518 PS13, Line 518: state_cond_.Broadcast(); Signal()? We're just waking up one thread in Init. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@528 PS13, Line 528: } Should we close socket_ here, after the poll thread has exited? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@545 PS13, Line 545: PLOG(WARNING) << "NTP recv error"; Do we want to throttle this? Or is it rare? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@572 PS13, Line 572: memcpy(&resp, buf, sizeof(NtpPacket)); Could you recvfrom() directly into the NtpPacket and avoid the copy? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@609 PS13, Line 609: LI Where is this enforced? On L624 we reject servers whose LI is 3, not 0. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@625 PS13, Line 625: VLOG(1) << Substitute("NTP server $0 is unsynchronized", Curious why this and L634 are VLOG and not KLOG_EVERY_N_SECS? Are they common? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@717 PS13, Line 717: RecordResponse(p->server, rr); No std::move here? Trivially copyable? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@733 PS13, Line 733: // Mark the server as offline. When is this ever undone? What happens when we've marked all of our servers as offline? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@746 PS13, Line 746: if (now > s->next_poll_) { Maybe >= so that, if all of this code gets run quickly at Init() time (and both 'now' and 's->next_poll_' wind up with the same value), we won't wait for the next poll period? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@748 PS13, Line 748: int delay = poll_interval + rng_.Uniform(poll_interval); Is the jitter mandated by a spec? If not, what's the rationale? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@893 PS13, Line 893: // This lambda builds an intersection interval given set of correctness : // intervals and reference local time. Why define this as a lambda given it's only called once? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@996 PS13, Line 996: std::lock_guard<RWMutex> l(last_compute_lock_); This could be deferred to L1000. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@1024 PS13, Line 1024: state_cond_.Broadcast(); Should be Signal, right? http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/ntp-test.cc File src/kudu/clock/ntp-test.cc: http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/ntp-test.cc@48 PS9, Line 48: > this isn't really a test :) Yeah, now that we have more legit tests, perhaps we should remove this? Moreover, won't the WalltimeWithError() calls fail if the local machine has no Internet connectivity? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc File src/kudu/clock/ntp-test.cc: http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@89 PS13, Line 89: scenariso scenarios http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@101 PS13, Line 101: // TODO(aserbin): fix the described chrony's bug, at least in thirdparty Yeah do you have a sense for how much work that would be? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@130 PS13, Line 130: TEST_F(BuiltinNtpWithMiniChronydTest, Basic) { This test looks quite similar to TestNtp_SystemVsBuiltinNtp. http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@131 PS13, Line 131: const uint16_t base_port = 10123 + getpid() % 1000; How much work would it be to patch chrony to allow binding to an ephemeral port? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@134 PS13, Line 134: for (auto idx = 0; idx < 3; ++idx) { : MiniChronydOptions options; : options.index = idx; : options.bindaddress = GetBindIpForDaemon(options.index + 1, kDefaultBindMode); : options.port = base_port + idx * 100; : unique_ptr<MiniChronyd> chrony(new MiniChronyd(options)); : ASSERT_OK(chrony->Start()); : : servers_endpoints.emplace_back(options.bindaddress, options.port); : servers.emplace_back(std::move(chrony)); : } All of the tests repeat this sort of block, could you encapsulate it in a function? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@305 PS13, Line 305: User Do you mean "Use" here (and below on L331)? http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@320 PS13, Line 320: { : const vector<HostPort> refs = { unsync_servers_refs[1] }; : : // Verify chronyd's client itself does not accept the set of of NTP sources. : NO_FATALS(CheckNoNtpSource(refs)); : : BuiltInNtp c({ unsync_servers_refs[1] }); : NO_FATALS(CheckInitUnsync(&c)); : NO_FATALS(CheckWallTimeUnavailable(&c)); : } Some of this repetition could be elided via lambda. -- To view, visit http://gerrit.cloudera.org:8080/7477 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ieb9eee9f0334236f39617492bd6f01304d1a0255 Gerrit-Change-Number: 7477 Gerrit-PatchSet: 16 Gerrit-Owner: Todd Lipcon <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Brock Noland <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Sat, 14 Sep 2019 00:19:33 +0000 Gerrit-HasComments: Yes
