d-c-manning commented on a change in pull request #2762:
URL: https://github.com/apache/hbase/pull/2762#discussion_r541445488
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
if (oldestRITTime < ritTime) {
oldestRITTime = ritTime;
}
+ if (counter < 500) { // Record 500 oldest RITs
Review comment:
500 seems high to me. Would 20 be enough? Should this be a constant in
the file, even if not configurable?
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
if (oldestRITTime < ritTime) {
oldestRITTime = ritTime;
}
+ if (counter < 500) { // Record 500 oldest RITs
+ oldestRITHashesAndStates.add(
+ state.getRegion().getRegionNameAsString() + ":" +
state.getState().name()
+ );
+ }
+ counter += 1;
}
if (this.metricsAssignmentManager != null) {
this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
this.metricsAssignmentManager.updateRITCount(totalRITs);
this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+ LOG.debug("Oldest RIT hashes and states: " +
oldestRITHashesAndStates.toString());
+ long time = EnvironmentEdgeManager.currentTime();
+ if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) {
Review comment:
Why do we update the metrics hashes only if it's been more than
`ritThreshold / 2` time since last update? If we do the work to find the
oldest, it seems like we should update the metrics always. Then any query to
get metrics will always have the most recent results (~3 seconds old at max,
with default `hbase.regionserver.msginterval`)
Though I do think it's a good idea for limiting the logging. Should the
`LOG.debug` statement be in here?
Then this statement may deserve a comment.
```suggestion
// Only log if it has been long enough since the last update (default
30 seconds)
if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) {
```
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3538,9 +3541,11 @@ public void updateRegionsInTransitionMetrics() {
int totalRITs = 0;
int totalRITsOverThreshold = 0;
long oldestRITTime = 0;
+ Set<String> oldestRITHashesAndStates = Sets.newHashSet(); // set of <rit
hash>:<rit state>
int ritThreshold = this.server.getConfiguration().
getInt(HConstants.METRICS_RIT_STUCK_WARNING_THRESHOLD, 60000);
- for (RegionState state: regionStates.getRegionsInTransition()) {
+ int counter = 0;
+ for (RegionState state:
regionStates.getRegionsInTransitionOrderedByDuration()) {
Review comment:
since this method uses the `state.getStamp()` to determine whether the
RIT is over the threshold or the oldest time... should we just use
`getRegionsInTransitionOrderedByTimestamp` instead?
This new method `getRegionsInTransitionOrderedByDuration` is tracking total
duration instead of duration since last state transition, so the longest RITs
may not necessarily be those that are reported over threshold. I think it's
probably good to be consistent... I'm not sure which way is better to be
consistent, but since these metrics in this method are reporting time since
last change instead of total duration, I'd favor that approach.
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
if (oldestRITTime < ritTime) {
oldestRITTime = ritTime;
}
+ if (counter < 500) { // Record 500 oldest RITs
+ oldestRITHashesAndStates.add(
+ state.getRegion().getRegionNameAsString() + ":" +
state.getState().name()
+ );
+ }
+ counter += 1;
}
if (this.metricsAssignmentManager != null) {
this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
this.metricsAssignmentManager.updateRITCount(totalRITs);
this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+ LOG.debug("Oldest RIT hashes and states: " +
oldestRITHashesAndStates.toString());
Review comment:
we should only log if `oldestRITHashesAndStates` is non-empty.
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
if (oldestRITTime < ritTime) {
oldestRITTime = ritTime;
}
+ if (counter < 500) { // Record 500 oldest RITs
+ oldestRITHashesAndStates.add(
+ state.getRegion().getRegionNameAsString() + ":" +
state.getState().name()
+ );
+ }
+ counter += 1;
}
if (this.metricsAssignmentManager != null) {
this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
this.metricsAssignmentManager.updateRITCount(totalRITs);
this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+ LOG.debug("Oldest RIT hashes and states: " +
oldestRITHashesAndStates.toString());
+ long time = EnvironmentEdgeManager.currentTime();
Review comment:
can we just reuse `currentTime` here from the beginning of the method?
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
if (oldestRITTime < ritTime) {
oldestRITTime = ritTime;
}
+ if (counter < 500) { // Record 500 oldest RITs
+ oldestRITHashesAndStates.add(
+ state.getRegion().getRegionNameAsString() + ":" +
state.getState().name()
Review comment:
Do we have to use `name()` as I see other references use
`state.getState()` directly.
region names could be quite long - when you refer to hash, did you want the
md5 encoded name?
```suggestion
state.getRegion().getEncodedName() + ":" + state.getState()
```
Or should the entire region name be available?
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
if (oldestRITTime < ritTime) {
oldestRITTime = ritTime;
}
+ if (counter < 500) { // Record 500 oldest RITs
+ oldestRITHashesAndStates.add(
Review comment:
should we also filter these to only those RITs that are over the
threshold? Or do we want to display every RIT, even if it's only been
transitioning for 100ms, if it's the longest/only RIT in the cluster?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]