kaisun2000 commented on a change in pull request #365: Fix RoutingTableProvider statePropagationLatency metric reporting bug URL: https://github.com/apache/helix/pull/365#discussion_r308896237
########## File path: helix-core/src/main/java/org/apache/helix/common/caches/CurrentStateSnapshot.java ########## @@ -32,18 +37,32 @@ public CurrentStateSnapshot(final Map<PropertyKey, CurrentState> currentStateMap if (_updatedStateKeys != null && _prevStateMap != null) { // Note if the prev state map is empty, this is the first time refresh. // So the update is not considered as "recent" change. + int driftCnt = 0; // clock drift count for comparing timestamp for (PropertyKey propertyKey : _updatedStateKeys) { CurrentState prevState = _prevStateMap.get(propertyKey); CurrentState curState = _properties.get(propertyKey); Map<String, Long> partitionUpdateEndTimes = null; for (String partition : curState.getPartitionStateMap().keySet()) { long newEndTime = curState.getEndTime(partition); - if (prevState == null || prevState.getEndTime(partition) < newEndTime) { + if (prevState == null + || prevState.getEndTime(partition) < newEndTime && prevState.getEndTime(partition) != -1) { if (partitionUpdateEndTimes == null) { partitionUpdateEndTimes = new HashMap<>(); } partitionUpdateEndTimes.put(partition, newEndTime); + } else if (prevState != null && prevState.getEndTime(partition) > newEndTime) { + // This can happen due to clock drift. + // updatedStateKeys is the path to resource in an instance config. + // Thus, the space of inner loop is Sigma{replica(i) * partition(i)}; i over all resources in the cluster + // This space can be large. In order not to print two many lines, we print first warning for the first case. + // If clock drift turns out to be common, we can consider print out more logs, or expose an metric. + if (driftCnt < 1) { Review comment: 1/ boolean is enough. But what if later you decide to log once for every 100 such event. In this case, we can easily change to if (driftCnt % 100 == 0) 2/ 3/ Currently, the production has NTP (network time protocol) deamon running, which would sync the clock to around 10ms in the same data center. This is something fine for us. I think your concern about "scary" to customer is valid. However, on the other hand, if we don't log this warning, what if there are some "huge spikes" in statePropagationDelay metrics, due to clock drift say NTP config messed up. We still need to troubleshoot, but this time, without any clue. In fact, this time, the database guys actually asked for "huge spikes" statePropagationDelay which is due to the "-1" issue fixed here. So weighing the pro and cons from both side. It seems that adding this log is still a better "evil"? Let me know what is your take here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services