[GitHub] [helix] kaisun2000 commented on a change in pull request #365: Fix RoutingTableProvider statePropagationLatency metric reporting bug

GitBox Tue, 30 Jul 2019 12:16:37 -0700

kaisun2000 commented on a change in pull request #365: Fix RoutingTableProvider 
statePropagationLatency metric reporting bug
URL: https://github.com/apache/helix/pull/365#discussion_r308896237


 ##########
 File path: 
helix-core/src/main/java/org/apache/helix/common/caches/CurrentStateSnapshot.java
 ##########
 @@ -32,18 +37,32 @@ public CurrentStateSnapshot(final Map<PropertyKey, 
CurrentState> currentStateMap
     if (_updatedStateKeys != null && _prevStateMap != null) {
       // Note if the prev state map is empty, this is the first time refresh.
       // So the update is not considered as "recent" change.
+      int driftCnt = 0; // clock drift count for comparing timestamp
       for (PropertyKey propertyKey : _updatedStateKeys) {
         CurrentState prevState = _prevStateMap.get(propertyKey);
         CurrentState curState = _properties.get(propertyKey);
 
         Map<String, Long> partitionUpdateEndTimes = null;
         for (String partition : curState.getPartitionStateMap().keySet()) {
           long newEndTime = curState.getEndTime(partition);
-          if (prevState == null || prevState.getEndTime(partition) < 
newEndTime) {
+          if (prevState == null
+              || prevState.getEndTime(partition) < newEndTime && 
prevState.getEndTime(partition) != -1) {
             if (partitionUpdateEndTimes == null) {
               partitionUpdateEndTimes = new HashMap<>();
             }
             partitionUpdateEndTimes.put(partition, newEndTime);
+          } else if (prevState != null && prevState.getEndTime(partition) > 
newEndTime) {
+            // This can happen due to clock drift.
+            // updatedStateKeys is the path to resource in an instance config.
+            // Thus, the space of inner loop is Sigma{replica(i) * 
partition(i)}; i over all resources in the cluster
+            // This space can be large. In order not to print two many lines, 
we print first warning for the first case.
+            // If clock drift turns out to be common, we can consider print 
out more logs, or expose an metric.
+            if (driftCnt < 1) {
 
 Review comment:
   1/ boolean is enough. But what if later you decide to log once for every 100 
such event. In this case, we can easily change to if (driftCnt % 100 == 0) 
   
   2/ 3/ Currently, the production has NTP (network time protocol) deamon 
running, which would sync the clock to around 10ms in the same data center. 
This is something fine for us. 
   
   I think your concern about  "scary" to customer is valid. However, on the 
other hand, if we don't log this warning, what if there are some "huge spikes" 
in statePropagationDelay metrics, due to clock drift say NTP config messed up.  
We still need to troubleshoot, but this time, without any clue. 
   
   In fact, this time, the database guys actually asked for "huge spikes" 
statePropagationDelay which is due to the "-1" issue fixed here. 
   
   So weighing the pro and cons from both side. It seems that adding this log 
is still a better "evil"?
   
   Let me know what is your take here. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [helix] kaisun2000 commented on a change in pull request #365: Fix RoutingTableProvider statePropagationLatency metric reporting bug

Reply via email to