jiajunwang commented on a change in pull request #1413:
URL: https://github.com/apache/helix/pull/1413#discussion_r500491326



##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -428,13 +471,30 @@ public synchronized void setIdealStates(List<IdealState> 
idealStates) {
   }
 
   /**
-   * Returns the LiveInstances for each of the instances that are curretnly up 
and running
+   * Returns the LiveInstances for each of the instances that are currently up 
and running
    * @return
    */
   public Map<String, LiveInstance> getLiveInstances() {
+    return getLiveInstances(true);
+  }
+
+  /**
+   * Returns the LiveInstances for each of the instances that are currently up 
and running
+   * @param excludeTimeoutInstances - Only effective during maintenance mode. 
If true, filter out
+   *                                instances that are timed-out during 
maintenance mode; instances
+   *                                are timed-out if they have been offline 
for a while before going
+   *                                live during maintenance mode.
+   * @return
+   */
+  public Map<String, LiveInstance> getLiveInstances(boolean 
excludeTimeoutInstances) {

Review comment:
       I suggest making it explicit.
   One method getAllLiveInstances(), another one something like 
getActiveLiveInstance().
   
   In the future, the later one can be extended to return different lists 
according to the conditions (maintenance mode is the first example).

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance 
is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor 
accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        
accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            
ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));

Review comment:
       Just put the logic into a non-static method getOnlineTimestamp()? Which 
directly returns List<Long>.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -253,6 +257,10 @@ private void refreshIdealState(final HelixDataAccessor 
accessor,
   private void refreshLiveInstances(final HelixDataAccessor accessor,
       Set<HelixConstants.ChangeType> refreshedType) {
     if 
(_propertyDataChangedMap.get(HelixConstants.ChangeType.LIVE_INSTANCE).getAndSet(false))
 {
+      // Keep a snapshot of old live instances for maintenance mode
+      if (isMaintenanceModeEnabled()) {

Review comment:
       Please be careful about the cache update order. refreshLiveInstances is 
called before updateMaintenanceInfo. So I guess this map never works as 
expected. As I suggested, just remove this map to reduce the potential risk. 
ROI is low.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor 
accessor) {
     // The following flag is to guarantee that there's only one update per 
pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) 
{
+    // If maintenance mode is enabled and timeout window is specified, filter 
'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {
+      for (String instance : _liveInstanceCache.getPropertyMap().keySet()) {
+        // 1. Check timed-out cache and don't do repeated work;
+        // 2. Check for nodes that didn't exist in the last iteration, because 
it has been checked;

Review comment:
       I guess this one is corresponding to the 
_liveInstanceSnapshotForMaintenance check. I prefer not to add this.
   1. If the cache update fails due to ZK issue, then we might have the 
_liveInstanceSnapshotForMaintenance updated but 
_timedOutInstanceDuringMaintenance is not fully calculated.
   2. The performance enhancement of using this map is not significant. Let's 
ensure it works fine first.

##########
File path: 
helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new 
SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + 
dateString);
+      return -1;
+    }
+  }
+
+  /**
+   * Parses a history date in millisecond to string.
+   */
+  public static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }
+
+  /**
+   * Parses the session entry map that has been converted to string back to a 
map.
+   * NOTE TO CALLER: This assumes the divider between entries is ", " and the 
divider between
+   * key/value is "="; if the string is malformed, parsing correctness is not 
guaranteed. Always
+   * check if a key is contained before using the key.
+   */
+  public static Map<String, String> parseSessionHistoryStringToMap(String 
sessionHistoryString) {
+    sessionHistoryString = sessionHistoryString.substring(1, 
sessionHistoryString.length() - 1);
+    Map<String, String> sessionHistoryMap = new HashMap<>();
+
+    for (String sessionHistoryKeyValuePair : sessionHistoryString.split(", ")) 
{
+      String[] keyValuePair = sessionHistoryKeyValuePair.split("=");
+      if (keyValuePair.length < 2) {
+        continue;

Review comment:
       nit, warning in these abnormal case for debugging.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +782,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance 
is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor 
accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        
accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            
ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > 
_maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out 
since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            
ParticipantHistory.parseHistoryDateStringToLong(historyOfflineList.get(i));
+        // Ignore bad format case
+        if (tailOfflineTime != -1) {
+          if (tailOfflineTime <= onlineTimestamps.get(0)) {
+            break;
+          }
+          offlineTimestamps.add(0, tailOfflineTime);
+        }
+      }
+    }
+
+    // At this point, onlineTimestamps contains at least 1 timestamp that's 
before maintenance mode;
+    // offlineTimestamps contains 0+ timestamp that's > the first online 
timestamp.

Review comment:
       As I mentioned above, the current logic is too complicated than it needs 
to be. For a newly added node, the offline time is timestamp 0. So it is 
definitely timeout.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance 
is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor 
accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        
accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            
ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > 
_maintenanceSignal.getTimestamp()) {

Review comment:
       I agree these nodes shall be excluded. But we are actually doing this 
for a different reason.
   Note that the logic is a node is offline for a long time and back online 
during the M mode. In this case, the new node previous offline time is 0. So it 
is definitely timed out. So we don't need this logic here. The later logic 
shall be able to cover this case.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance 
is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor 
accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        
accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            
ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts

Review comment:
       nit, I think you need all the online timestamp before the 
_maintenanceSignal create time. Otherwise, you don't need this list. So the 
comment is not accurate.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -299,6 +307,40 @@ private void updateMaintenanceInfo(final HelixDataAccessor 
accessor) {
     // The following flag is to guarantee that there's only one update per 
pineline run because we
     // check for whether maintenance recovery could happen twice every pipeline
     _hasMaintenanceSignalChanged = false;
+
+    // If maintenance mode has exited, clear cached timed-out nodes
+    if (!_isMaintenanceModeEnabled) {
+      _timedOutInstanceDuringMaintenance.clear();
+      _liveInstanceSnapshotForMaintenance.clear();
+    }
+  }
+
+  private void timeoutNodesDuringMaintenance(final HelixDataAccessor accessor) 
{
+    // If maintenance mode is enabled and timeout window is specified, filter 
'new' live nodes
+    // for timed-out nodes
+    long timeOutWindow = -1;
+    if (_clusterConfig != null) {
+      timeOutWindow = _clusterConfig.getOfflineNodeTimeOutForMaintenanceMode();
+    }
+    if (timeOutWindow >= 0 && isMaintenanceModeEnabled()) {

Review comment:
       To reduce the implicit dependencies between the refresh/update methods, 
please input the isMaintenanceModeEnabled as a parameter. So the caller is more 
likely to pass the refreshed value.
   
   Overall, we shall reduce referring to the private field directly to decouple 
methods and prevent potential bugs.

##########
File path: 
helix-core/src/main/java/org/apache/helix/model/ParticipantHistory.java
##########
@@ -153,16 +183,73 @@ private void updateOfflineHistory(long time) {
     if (list.size() == HISTORY_SIZE) {
       list.remove(0);
     }
-
-    DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss:SSS");
-    df.setTimeZone(TimeZone.getTimeZone("UTC"));
-    String dateTime = df.format(new Date(time));
-
-    list.add(dateTime);
+    list.add(historyDateLongToString(time));
   }
 
   @Override
   public boolean isValid() {
     return true;
   }
+
+  /**
+   * Parses a history date in string format to its millisecond representation.
+   * Returns -1 if parsing fails.
+   */
+  public static long historyDateStringToLong(String dateString) {
+    SimpleDateFormat simpleDateFormat = new 
SimpleDateFormat(HISTORY_DATE_FORMAT);
+    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
+    try {
+      Date date = simpleDateFormat.parse(dateString);
+      return date.getTime();
+    } catch (ParseException e) {
+      LOG.warn("Failed to parse participant history date string: " + 
dateString);
+      return -1;
+    }
+  }
+
+  /**
+   * Parses a history date in millisecond to string.
+   */
+  public static String historyDateLongToString(long dateLong) {
+    DateFormat df = new SimpleDateFormat(HISTORY_DATE_FORMAT);
+    df.setTimeZone(TimeZone.getTimeZone("UTC"));
+    return df.format(new Date(dateLong));
+  }

Review comment:
       As commented above, these 2 methods can be private if we directly return 
the lists of Long

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance 
is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor 
accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        
accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {

Review comment:
       nit, the logic might be clearer for the readers if you just subtract the 
list and then read the last one. Otherwise, you have multiple lists with 
different orders. And get(0) in the following code could be a mystery for 
others.

##########
File path: 
helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -728,6 +788,100 @@ private void updateDisabledInstances() {
     }
   }
 
+  /*
+   * Check if the instance is timed-out during maintenance mode. An instance 
is timed-out if it has
+   * been offline for longer than the user defined timeout window.
+   * @param timeOutWindow - the timeout window; guaranteed to be non-negative
+   */
+  private boolean isInstanceTimedOutDuringMaintenance(HelixDataAccessor 
accessor, String instance,
+      long timeOutWindow) {
+    ParticipantHistory history =
+        
accessor.getProperty(accessor.keyBuilder().participantHistory(instance));
+    long currentTime = System.currentTimeMillis();
+
+    // Parse online timestamps from online list
+    List<Long> onlineTimestamps = new ArrayList<>();
+    List<String> historyOnlineList = history.getHistory();
+    if (historyOnlineList != null) {
+      long tailOnlineTime;
+      for (int i = historyOnlineList.size() - 1; i >= 0; i--) {
+        tailOnlineTime =
+            
ParticipantHistory.extractTimeFromSessionHistoryString(historyOnlineList.get(i));
+        // Ignore bad format case
+        if (tailOnlineTime != -1) {
+          onlineTimestamps.add(0, tailOnlineTime);
+          // We only need one online timestamp before maintenance mode starts
+          if (tailOnlineTime <= _maintenanceSignal.getTimestamp()) {
+            break;
+          }
+        }
+      }
+    }
+    if (onlineTimestamps.isEmpty() || onlineTimestamps.get(0) > 
_maintenanceSignal.getTimestamp()) {
+      // Node is a new node in this maintenance mode, no need to time out 
since no partitions are
+      // assigned to it
+      return false;
+    }
+
+    // Parse offline timestamps from offline list
+    List<Long> offlineTimestamps = new ArrayList<>();
+    List<String> historyOfflineList = history.getOffline();
+    if (historyOfflineList != null) {
+      long tailOfflineTime;
+      for (int i = historyOfflineList.size() - 1; i >= 0; i--) {
+        tailOfflineTime =
+            
ParticipantHistory.historyDateStringToLong(historyOfflineList.get(i));

Review comment:
       Same here, getOfflineTimestamp




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to