d-c-manning commented on a change in pull request #2762:
URL: https://github.com/apache/hbase/pull/2762#discussion_r541445488



##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs

Review comment:
       500 seems high to me. Would 20 be enough? Should this be a constant in 
the file, even if not configurable?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + 
state.getState().name()
+        );
+      }
+      counter += 1;
     }
     if (this.metricsAssignmentManager != null) {
       this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
       this.metricsAssignmentManager.updateRITCount(totalRITs);
       
this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+      LOG.debug("Oldest RIT hashes and states: " + 
oldestRITHashesAndStates.toString());
+      long time = EnvironmentEdgeManager.currentTime();
+      if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) {

Review comment:
       Why do we update the metrics hashes only if it's been more than 
`ritThreshold / 2` time since last update? If we do the work to find the 
oldest, it seems like we should update the metrics always. Then any query to 
get metrics will always have the most recent results (~3 seconds old at max, 
with default `hbase.regionserver.msginterval`)
   
   Though I do think it's a good idea for limiting the logging. Should the 
`LOG.debug` statement be in here?
   
   Then this statement may deserve a comment.
   ```suggestion
         // Only log if it has been long enough since the last update (default 
30 seconds)
         if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) {
   ```

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3538,9 +3541,11 @@ public void updateRegionsInTransitionMetrics() {
     int totalRITs = 0;
     int totalRITsOverThreshold = 0;
     long oldestRITTime = 0;
+    Set<String> oldestRITHashesAndStates = Sets.newHashSet(); // set of <rit 
hash>:<rit state>
     int ritThreshold = this.server.getConfiguration().
       getInt(HConstants.METRICS_RIT_STUCK_WARNING_THRESHOLD, 60000);
-    for (RegionState state: regionStates.getRegionsInTransition()) {
+    int counter = 0;
+    for (RegionState state: 
regionStates.getRegionsInTransitionOrderedByDuration()) {

Review comment:
       since this method uses the `state.getStamp()` to determine whether the 
RIT is over the threshold or the oldest time... should we just use 
`getRegionsInTransitionOrderedByTimestamp` instead?
   
   This new method `getRegionsInTransitionOrderedByDuration` is tracking total 
duration instead of duration since last state transition, so the longest RITs 
may not necessarily be those that are reported over threshold. I think it's 
probably good to be consistent... I'm not sure which way is better to be 
consistent, but since these metrics in this method are reporting time since 
last change instead of total duration, I'd favor that approach.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + 
state.getState().name()
+        );
+      }
+      counter += 1;
     }
     if (this.metricsAssignmentManager != null) {
       this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
       this.metricsAssignmentManager.updateRITCount(totalRITs);
       
this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+      LOG.debug("Oldest RIT hashes and states: " + 
oldestRITHashesAndStates.toString());

Review comment:
       we should only log if `oldestRITHashesAndStates` is non-empty.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + 
state.getState().name()
+        );
+      }
+      counter += 1;
     }
     if (this.metricsAssignmentManager != null) {
       this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
       this.metricsAssignmentManager.updateRITCount(totalRITs);
       
this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+      LOG.debug("Oldest RIT hashes and states: " + 
oldestRITHashesAndStates.toString());
+      long time = EnvironmentEdgeManager.currentTime();

Review comment:
       can we just reuse `currentTime` here from the beginning of the method?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + 
state.getState().name()

Review comment:
       Do we have to use `name()` as I see other references use 
`state.getState()` directly.
   
   region names could be quite long - when you refer to hash, did you want the 
md5 encoded name?
   ```suggestion
             state.getRegion().getEncodedName() + ":" + state.getState()
   ```
   Or should the entire region name be available?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(

Review comment:
       should we also filter these to only those RITs that are over the 
threshold? Or do we want to display every RIT, even if it's only been 
transitioning for 100ms, if it's the longest/only RIT in the cluster?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to