Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11052 )
Change subject: IMPALA-6644: Add last heartbeat timestamp into Statestore metric ...................................................................... Patch Set 5: (10 comments) http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.h File be/src/statestore/statestore.h: http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.h@444 PS5, Line 444: int64_t last_heartbeat_ts_; nit: technically this should be atomic, right? The "monitor" thread is accessing this variable only under 'subscribers_lock_', but this variable gets *set* from the heartbeat-sending thread which doesn't hold that lock, if I read the code correctly. In practice it probably doesn't matter since it's just an int and x86 has a strong enough memory model, but I think it would be better to use a std::atomic<int64_t> here so that tools like TSAN don't complain. http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.h@687 PS5, Line 687: [[noreturn]] void MonitorSubscriberHeartbeat(); the 'MainLoop' call seems to indicate that it would run until some exit flag is set, but I don't see the exit flag. So, i'm not 100% sure: is it possible that a StateStore object ever destructs? If it did, I think this thread would keep on running and then crash as it accessed the freed memory of the StateStore object. If we believe that StateStore can never destruct, can we add a destructor which does something like a LOG(FATAL) << "Cannot shut down StateStore object once started" in the case that it sees that Init() was called? That way if someone tries to destruct a StateStore we'll get an obvious crash instead of a tricky-to-debug one http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc File be/src/statestore/statestore.cc: http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@78 PS5, Line 78: hearbeats nit: typo http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@315 PS5, Line 315: last_heartbeat_ts_(0) { is there any race here when a subscriber first registers before it sends a heartbeat where it will report as something like 48 years since heartbeat? Perhaps this should be set to the current time instead? http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@389 PS5, Line 389: DCHECK_GT(timestamp_ms, last_heartbeat_ts_); if these are wall times, it's possible for them to go backwards if the host clock resets. Perhaps we should be using monotonic times here to avoid this issue? http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@542 PS5, Line 542: Value last_heartbeat_timestamp( for the purposes of the user understanding this data, I'd think it would be more usable to output the number of seconds since the heartbeat, instead of the last heartbeat time. It's much easier to scan down a list of 100 subscribers and find the one with a big value, instead of trying to scan down a list of timestamps to find the one that is farthest in the past. http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@984 PS5, Line 984: int num_subscribers, num_active_subscribers; You don't gain any real performance by declaring these outside of the loop, and I think you lose some clarity. For the ints, I don't think it makes a performance difference at all, since the compiler will just allocate stack slots for them either way, and there is no constructor/destructor. For the vector, in theory there might be some gain, but given this loop only runs once a minute I don't think saving potentially 1 tiny allocation matters. Instead, could you move these down to inside the for loop so it's clear they don't cross from one loop iteration to the next? I think doing so will remove some lines of code and make it more obvious that only 'previous_timestamp' is carried from one iteration to the next. http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@990 PS5, Line 990: num_subscribers = subscribers_.size(); maybe it would be clearer to not track this incrementally, but instead, just before the log statement, do: int num_active_subscrbers = num_subscribers - inactive_subscribers.size(); since it would reduce some extra branching and mental overhead http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@992 PS5, Line 992: SubscriberMap::value_type nit: i think using 'const auto&' is just as good here http://gerrit.cloudera.org:8080/#/c/11052/5/be/src/statestore/statestore.cc@998 PS5, Line 998: LOG(INFO) << num_active_subscribers << "/" << num_subscribers is it worth escalating this to WARN level if there are any inactive? -- To view, visit http://gerrit.cloudera.org:8080/11052 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I754adccc4569e8219d5d01500cccdfc8782953f7 Gerrit-Change-Number: 11052 Gerrit-PatchSet: 5 Gerrit-Owner: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Wed, 08 Aug 2018 00:48:47 +0000 Gerrit-HasComments: Yes
