Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7477 )

Change subject: [clock] add a built-in NTP client implementation
......................................................................


Patch Set 16:

(50 comments)

I didn't read any RFCs so there's a fair amount of missing coverage there 
(would be great if Todd could take a look since he originally implemented this).

http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h
File src/kudu/clock/builtin_ntp.h:

http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h@45
PS13, Line 45: class BuiltInNtp : public TimeService {
Could use some more documentation, at least for the class (i.e. what are we 
trying to achieve with this?) and the private members.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h@73
PS13, Line 73:   bool TryReceivePacket();
Should doc (amongst other things) what the return value means.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.h@92
PS13, Line 92:   int64_t last_compute_mono_ = 0;
             :   int64_t last_compute_wall_ = 0;
             :   int64_t last_compute_error_ = 0;
             :   bool is_synchronized_ = false;
Might want to encapsulate these in a struct so it's easy to make a local copy 
with the lock held.


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc
File src/kudu/clock/builtin_ntp.cc:

http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@78
PS9, Line 78: TAG_FLAG(builtin_ntp_request_timeout_ms, experimental);
> maybe this should be a flag?
Done


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@286
PS9, Line 286:
             :   // The monotonic timestamp wh
> I seem to recall I looked into how to find this info and determined that it
Do we need to convert this into a TODO, or a note?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@385
PS9, Line 385:   }
> i seem to recall we now have some more general "wait for NTP at startup' th
Yeah, it looks like Todd's reimplementation of this logic remained and is now 
duplicated here and in system_ntp.cc. Could we consolidate them, perhaps in 
HybridClock::Init?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@474
PS9, Line 474:
> maybe we should include the size of the packet, or the packet itself?
Alexey, what do you think?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@509
PS9, Line 509:   {
             :     MutexLock l(state_lock_);
> I guess we should read the RFC carefully and figure this out
Alexey, did you explore this comment?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@523
PS9, Line 523:     // will get EPIPE.
             :     socket_.Shutdown(true, true);
> improve this error
Done


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@536
PS9, Line 536:   // unclear what clock those come from
> probably worth doing this
Alexey, what do you think of this?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@583
PS9, Line 583:   // 3.  The Originate Timestamp in the server reply should 
match the
> should resolve this too
Alexey, did you work this out?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@666
PS9, Line 666:   //    Originate Timestamp     T1   time request sent by client
> seems like we could probably factor this out and make it testable
+1, Alexey could you do this?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/builtin_ntp.cc@754
PS9, Line 754:
> yea todo
Done


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc
File src/kudu/clock/builtin_ntp.cc:

http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@58
PS13, Line 58:               "0.ubuntu.pool.ntp.org,"
             :               "1.ubuntu.pool.ntp.org,"
             :               "2.ubuntu.pool.ntp.org,"
             :               "3.ubuntu.pool.ntp.org",
Shouldn't we default to [0-3].pool.ntp.org? As per 
https://www.ntppool.org/en/use.html.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@66
PS13, Line 66:
             : // As of 2019, recent version of ntpd allows setting minpoll as 
low as 3
             : // (2^3 = 8 seconds), and chronyd supports minpoll as low as to 
-6
             : // (2^-6 = 1/64 second), so 16 seconds looks like a reasonble 
default.
Do we also want to express this in terms of powers of 2? Is that an interface 
users are more comfortable with?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@87
PS13, Line 87: DEFINE_int32(ntp_initial_sync_wait_secs, 60,
Why did this move from one time service impl to another?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@167
PS13, Line 167:   // NOTE: this is used as a nonce, not the actual time.
Does this comment belong near the transmit times? I'm thinking of 
CreateClientPacket's behavior.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@351
PS13, Line 351: currently
Nit: already said current before, remove one of these.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@352
PS13, Line 352:  If we
              :   // run out of addresses, we will re-resolve.
This doesn't appear to be implemented yet; move to a TODO?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@370
PS13, Line 370: Status BuiltInNtp::CreateFromFlags(unique_ptr<TimeService>* 
ntp) {
Since Init() is required and already returns a Status, you could defer the 
ParseStrings call to Init(), to enable the use of a simpler constructor as in 
the other TimeService implementations.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@374
PS13, Line 374: --builtin-ntp-servers
Nit: --builtin_ntp_servers


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@382
PS13, Line 382:
Nit: remove empty line


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@407
PS13, Line 407:   CHECK_OK(to_bind.ParseString("0.0.0.0", 0));
Do we need to make this configurable? Do we always want to bind to every 
interface?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@411
PS13, Line 411:   
RETURN_NOT_OK_PREPEND(socket_.SetRecvTimeout(MonoDelta::FromSeconds(0.5)),
              :                         "couldn't set socket recv timeout");
Why is this important?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@433
PS13, Line 433:         return Status::Incomplete(Substitute(
Why not TimedOut?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@440
PS13, Line 440:   return Status::OK();
Do you think we should return non-OK if state_ transitioned to kShutdown (i.e. 
concurrent thread called Shutdown)?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@458
PS13, Line 458:     return Status::IllegalState("wallclock is not synchronized",
I think the system NTP implementation returns ServiceUnavailable when not 
synchronized. Maybe we should use that here and on L470?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@489
PS13, Line 489:   for (const auto& s : servers_) {
Do we need to protect any server state with a lock? What if the poll thread 
simultaneously writes to a member while another thread is in DumpDiagnostics?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@515
PS13, Line 515:     state_ = kShutdown;
Maybe we should release state_lock_ after this? No need to hold it during the 
remaining syscalls, right?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@518
PS13, Line 518:     state_cond_.Broadcast();
Signal()? We're just waking up one thread in Init.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@528
PS13, Line 528:   }
Should we close socket_ here, after the poll thread has exited?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@545
PS13, Line 545:     PLOG(WARNING) << "NTP recv error";
Do we want to throttle this? Or is it rare?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@572
PS13, Line 572:   memcpy(&resp, buf, sizeof(NtpPacket));
Could you recvfrom() directly into the NtpPacket and avoid the copy?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@609
PS13, Line 609: LI
Where is this enforced? On L624 we reject servers whose LI is 3, not 0.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@625
PS13, Line 625:     VLOG(1) << Substitute("NTP server $0 is unsynchronized",
Curious why this and L634 are VLOG and not KLOG_EVERY_N_SECS? Are they common?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@717
PS13, Line 717:   RecordResponse(p->server, rr);
No std::move here? Trivially copyable?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@733
PS13, Line 733:       // Mark the server as offline.
When is this ever undone? What happens when we've marked all of our servers as 
offline?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@746
PS13, Line 746:     if (now > s->next_poll_) {
Maybe >= so that, if all of this code gets run quickly at Init() time (and both 
'now' and 's->next_poll_' wind up with the same value), we won't wait for the 
next poll period?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@748
PS13, Line 748:       int delay = poll_interval + rng_.Uniform(poll_interval);
Is the jitter mandated by a spec? If not, what's the rationale?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@893
PS13, Line 893:   // This lambda builds an intersection interval given set of 
correctness
              :   // intervals and reference local time.
Why define this as a lambda given it's only called once?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@996
PS13, Line 996:     std::lock_guard<RWMutex> l(last_compute_lock_);
This could be deferred to L1000.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/builtin_ntp.cc@1024
PS13, Line 1024:       state_cond_.Broadcast();
Should be Signal, right?


http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/ntp-test.cc
File src/kudu/clock/ntp-test.cc:

http://gerrit.cloudera.org:8080/#/c/7477/9/src/kudu/clock/ntp-test.cc@48
PS9, Line 48:
> this isn't really a test :)
Yeah, now that we have more legit tests, perhaps we should remove this? 
Moreover, won't the WalltimeWithError() calls fail if the local machine has no 
Internet connectivity?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc
File src/kudu/clock/ntp-test.cc:

http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@89
PS13, Line 89: scenariso
scenarios


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@101
PS13, Line 101: // TODO(aserbin): fix the described chrony's bug, at least in 
thirdparty
Yeah do you have a sense for how much work that would be?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@130
PS13, Line 130: TEST_F(BuiltinNtpWithMiniChronydTest, Basic) {
This test looks quite similar to TestNtp_SystemVsBuiltinNtp.


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@131
PS13, Line 131:   const uint16_t base_port = 10123 + getpid() % 1000;
How much work would it be to patch chrony to allow binding to an ephemeral port?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@134
PS13, Line 134:   for (auto idx = 0; idx < 3; ++idx) {
              :     MiniChronydOptions options;
              :     options.index = idx;
              :     options.bindaddress = GetBindIpForDaemon(options.index + 1, 
kDefaultBindMode);
              :     options.port = base_port + idx * 100;
              :     unique_ptr<MiniChronyd> chrony(new MiniChronyd(options));
              :     ASSERT_OK(chrony->Start());
              :
              :     servers_endpoints.emplace_back(options.bindaddress, 
options.port);
              :     servers.emplace_back(std::move(chrony));
              :   }
All of the tests repeat this sort of block, could you encapsulate it in a 
function?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@305
PS13, Line 305: User
Do you mean "Use" here (and below on L331)?


http://gerrit.cloudera.org:8080/#/c/7477/13/src/kudu/clock/ntp-test.cc@320
PS13, Line 320:   {
              :     const vector<HostPort> refs = { unsync_servers_refs[1] };
              :
              :     // Verify chronyd's client itself does not accept the set 
of of NTP sources.
              :     NO_FATALS(CheckNoNtpSource(refs));
              : 
              :     BuiltInNtp c({ unsync_servers_refs[1] });
              :     NO_FATALS(CheckInitUnsync(&c));
              :     NO_FATALS(CheckWallTimeUnavailable(&c));
              :   }
Some of this repetition could be elided via lambda.



--
To view, visit http://gerrit.cloudera.org:8080/7477
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ieb9eee9f0334236f39617492bd6f01304d1a0255
Gerrit-Change-Number: 7477
Gerrit-PatchSet: 16
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Brock Noland <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Comment-Date: Sat, 14 Sep 2019 00:19:33 +0000
Gerrit-HasComments: Yes

Reply via email to