[GitHub] [helix] jiajunwang commented on a change in pull request #365: Fix RoutingTableProvider statePropagationLatency metric reporting bug

GitBox Mon, 29 Jul 2019 23:30:28 -0700

jiajunwang commented on a change in pull request #365: Fix RoutingTableProvider 
statePropagationLatency metric reporting bug
URL: https://github.com/apache/helix/pull/365#discussion_r308552201


 ##########
 File path: 
helix-core/src/main/java/org/apache/helix/common/caches/CurrentStateSnapshot.java
 ##########
 @@ -32,18 +37,32 @@ public CurrentStateSnapshot(final Map<PropertyKey, 
CurrentState> currentStateMap
     if (_updatedStateKeys != null && _prevStateMap != null) {
       // Note if the prev state map is empty, this is the first time refresh.
       // So the update is not considered as "recent" change.
+      int driftCnt = 0; // clock drift count for comparing timestamp
       for (PropertyKey propertyKey : _updatedStateKeys) {
         CurrentState prevState = _prevStateMap.get(propertyKey);
         CurrentState curState = _properties.get(propertyKey);
 
         Map<String, Long> partitionUpdateEndTimes = null;
         for (String partition : curState.getPartitionStateMap().keySet()) {
           long newEndTime = curState.getEndTime(partition);
-          if (prevState == null || prevState.getEndTime(partition) < 
newEndTime) {
+          if (prevState == null
+              || prevState.getEndTime(partition) < newEndTime && 
prevState.getEndTime(partition) != -1) {
             if (partitionUpdateEndTimes == null) {
               partitionUpdateEndTimes = new HashMap<>();
             }
             partitionUpdateEndTimes.put(partition, newEndTime);
+          } else if (prevState != null && prevState.getEndTime(partition) > 
newEndTime) {
+            // This can happen due to clock drift.
+            // updatedStateKeys is the path to resource in an instance config.
+            // Thus, the space of inner loop is Sigma{replica(i) * 
partition(i)}; i over all resources in the cluster
+            // This space can be large. In order not to print two many lines, 
we print first warning for the first case.
+            // If clock drift turns out to be common, we can consider print 
out more logs, or expose an metric.
+            if (driftCnt < 1) {
 
 Review comment:
   Several comments here.
   1. driftCnt as a boolean will be enough?
   2. The drift means if 2 participants have a different clock, right? In this 
case, I strongly prefer to write a separate tool to measure instead of the 
piggypack that takes advantage of helix participant.
   Any warning message might be asked by our customer. Since it might look very 
scary to some customers, if we add this, we'd better be prepared to answer 
what's this message about and why we need this.
   Moreover, this will just sample the time drift situation. We will not 
understand to overall picture.
   3. In a distributed system, a certain extent of time drift is expected. So 
ideally, we should be able to tolerate. If we do find this warning in our log, 
what shall we do? If there is no next step, this log won't provide too much 
value.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [helix] jiajunwang commented on a change in pull request #365: Fix RoutingTableProvider statePropagationLatency metric reporting bug

Reply via email to