vli02 commented on code in PR #5081:
URL: https://github.com/apache/hbase/pull/5081#discussion_r1127118350
##########
hbase-client/src/main/java/org/apache/hadoop/hbase/client/MetricsConnection.java:
##########
@@ -443,6 +447,15 @@ protected Ratio getRatio() {
this.nsLookups = registry.counter(name(this.getClass(), NS_LOOKUPS,
scope));
this.nsLookupsFailed = registry.counter(name(this.getClass(),
NS_LOOKUPS_FAILED, scope));
+ this.userRegionLockTimeoutCount =
+ registry.counter(name(this.getClass(), "userRegionLockTimeoutCount",
scope));
+ this.userRegionLockWaitingTimer =
+ registry.timer(name(this.getClass(), "userRegionLockWaitingTimer",
scope));
+ this.userRegionLockHeldTimer =
+ registry.timer(name(this.getClass(), "userRegionLockHeldTimer", scope));
Review Comment:
Looks like all timer metrics having name with `DurationMs` in the end, maybe
we better do the same.
##########
hbase-client/src/main/java/org/apache/hadoop/hbase/client/MetricsConnection.java:
##########
@@ -598,6 +611,24 @@ public void incrementServerOverloadedBackoffTime(long
time, TimeUnit timeUnit) {
overloadedBackoffTimer.update(time, timeUnit);
}
+ /** incr */
Review Comment:
Yes, I have seen this test, too. Let me add the unit test soon.
##########
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java:
##########
@@ -1123,12 +1127,26 @@ rpcControllerFactory, getMetaLookupPool(),
connectionConfig.getMetaReadRpcTimeou
}
}
- void takeUserRegionLock() throws IOException {
+ private long heldStartTime;
+
+ void takeUserRegionLock(int tries) throws IOException {
try {
long waitTime = connectionConfig.getMetaOperationTimeout();
+ long waitStartTime = 0;
+ if (metrics != null) {
+ metrics.updateUserRegionLockQueue(userRegionLock.getQueueLength());
+ waitStartTime = EnvironmentEdgeManager.currentTime();
+ }
if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) {
+ if (metrics != null) {
+ metrics.incrUserRegionLockTimeout();
+ }
throw new LockTimeoutException("Failed to get user region lock in" +
waitTime + " ms. "
+ " for accessing meta region server.");
+ } else if (metrics != null) {
+ // successfully grabbed the lock, start timer of holding the lock
+ heldStartTime = EnvironmentEdgeManager.currentTime();
+ metrics.updateUserRegionLockWaiting(tries * waitTime + heldStartTime -
waitStartTime);
Review Comment:
You are right. The retries are actually coming from a higher layer, the
inside retry loop actually grab and release the lock every times. It seems not
straightforward for us to get the total time it has been taken for successfully
grabbing the lock for a rpc including the retry.
##########
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java:
##########
@@ -1123,12 +1127,26 @@ rpcControllerFactory, getMetaLookupPool(),
connectionConfig.getMetaReadRpcTimeou
}
}
- void takeUserRegionLock() throws IOException {
+ private long heldStartTime;
+
+ void takeUserRegionLock(int tries) throws IOException {
try {
long waitTime = connectionConfig.getMetaOperationTimeout();
+ long waitStartTime = 0;
+ if (metrics != null) {
+ metrics.updateUserRegionLockQueue(userRegionLock.getQueueLength());
+ waitStartTime = EnvironmentEdgeManager.currentTime();
+ }
if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) {
+ if (metrics != null) {
+ metrics.incrUserRegionLockTimeout();
+ }
throw new LockTimeoutException("Failed to get user region lock in" +
waitTime + " ms. "
+ " for accessing meta region server.");
+ } else if (metrics != null) {
+ // successfully grabbed the lock, start timer of holding the lock
+ heldStartTime = EnvironmentEdgeManager.currentTime();
+ metrics.updateUserRegionLockWaiting(tries * waitTime + heldStartTime -
waitStartTime);
Review Comment:
Make sense, let's just record the time for a single success of grabbing the
lock.
##########
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java:
##########
@@ -1123,12 +1127,26 @@ rpcControllerFactory, getMetaLookupPool(),
connectionConfig.getMetaReadRpcTimeou
}
}
- void takeUserRegionLock() throws IOException {
+ private long heldStartTime;
+
+ void takeUserRegionLock(int tries) throws IOException {
try {
long waitTime = connectionConfig.getMetaOperationTimeout();
+ long waitStartTime = 0;
+ if (metrics != null) {
+ metrics.updateUserRegionLockQueue(userRegionLock.getQueueLength());
+ waitStartTime = EnvironmentEdgeManager.currentTime();
+ }
if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) {
+ if (metrics != null) {
+ metrics.incrUserRegionLockTimeout();
+ }
throw new LockTimeoutException("Failed to get user region lock in" +
waitTime + " ms. "
+ " for accessing meta region server.");
+ } else if (metrics != null) {
+ // successfully grabbed the lock, start timer of holding the lock
+ heldStartTime = EnvironmentEdgeManager.currentTime();
+ metrics.updateUserRegionLockWaiting(tries * waitTime + heldStartTime -
waitStartTime);
Review Comment:
We have the metrics for number of failed/timeout grabbing the lock, this
would be helpful in the aspect of understanding the lock behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]