jiajunwang commented on a change in pull request #365: Fix RoutingTableProvider 
statePropagationLatency metric reporting bug
URL: https://github.com/apache/helix/pull/365#discussion_r309558098
 
 

 ##########
 File path: 
helix-core/src/main/java/org/apache/helix/common/caches/CurrentStateSnapshot.java
 ##########
 @@ -32,18 +37,32 @@ public CurrentStateSnapshot(final Map<PropertyKey, 
CurrentState> currentStateMap
     if (_updatedStateKeys != null && _prevStateMap != null) {
       // Note if the prev state map is empty, this is the first time refresh.
       // So the update is not considered as "recent" change.
+      int driftCnt = 0; // clock drift count for comparing timestamp
       for (PropertyKey propertyKey : _updatedStateKeys) {
         CurrentState prevState = _prevStateMap.get(propertyKey);
         CurrentState curState = _properties.get(propertyKey);
 
         Map<String, Long> partitionUpdateEndTimes = null;
         for (String partition : curState.getPartitionStateMap().keySet()) {
           long newEndTime = curState.getEndTime(partition);
-          if (prevState == null || prevState.getEndTime(partition) < 
newEndTime) {
+          if (prevState == null
+              || prevState.getEndTime(partition) < newEndTime && 
prevState.getEndTime(partition) != -1) {
             if (partitionUpdateEndTimes == null) {
               partitionUpdateEndTimes = new HashMap<>();
             }
             partitionUpdateEndTimes.put(partition, newEndTime);
+          } else if (prevState != null && prevState.getEndTime(partition) > 
newEndTime) {
+            // This can happen due to clock drift.
+            // updatedStateKeys is the path to resource in an instance config.
+            // Thus, the space of inner loop is Sigma{replica(i) * 
partition(i)}; i over all resources in the cluster
+            // This space can be large. In order not to print two many lines, 
we print first warning for the first case.
+            // If clock drift turns out to be common, we can consider print 
out more logs, or expose an metric.
+            if (driftCnt < 1) {
 
 Review comment:
   Sorry, I'm still not convinced. My previous question is that if we debug 
with this log, and we confirmed there is clock drift, what can we do? Most 
possibly fixing NTP, right? If that's the case, why not monitoring the NTP 
status and have some alert on that. This seems more direct and efficient.
   
   Moreover, with this check, what if the drift is to the other direction? We 
will just take the result and record, right? I just don't think we can do it 
good enough. This concern (if it really exists) should be addressed as a 
separate initiative.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to