[
https://issues.apache.org/jira/browse/HDFS-16040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351312#comment-17351312
]
Simbarashe Dzinamarira commented on HDFS-16040:
-----------------------------------------------
In Server.java we have
{noformat}
public void run() {
while (running) {
...
boolean connDropped = true;
try {
call = callQueue.take(); // pop the queue; maybe blocked here
...
if (alignmentContext != null && call.isCallCoordinated() &&
call.getClientStateId() > alignmentContext.getLastSeenStateId()) {
...
requeueCall(call);
continue;
}
...
} catch (InterruptedException e) {
...
} catch (Exception e) {
...
} finally {
...
if (call != null) {
updateMetrics(call, startTimeNanos, connDropped);
...
}
}
}
}
void updateMetrics(Call call, long startTime, boolean connDropped) {
...
rpcMetrics.addRpcQueueTime(queueTime);
if (call.isResponseDeferred() || connDropped) {
// call was skipped; don't include it in processing metrics
return;
}
...
rpcMetrics.addRpcProcessingTime(processingTime);
...
}{noformat}
*connDropped* is still true when updateMetrics is called due after the
*continue* statement so processing time is not updated while RpcQueueTime is.
A simple fix is to set *call = null* after the re-queue. We'll need a unit test
for this as well.
> RpcQueueTime metric counts requeued calls as unique events.
> -----------------------------------------------------------
>
> Key: HDFS-16040
> URL: https://issues.apache.org/jira/browse/HDFS-16040
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.10.0, 3.3.0
> Reporter: Simbarashe Dzinamarira
> Assignee: Simbarashe Dzinamarira
> Priority: Major
>
> The RpcQueueTime metric is updated every time a call is re-queued while
> waiting for the server state to reach the call's client's state ID. This is
> in contrast to RpcProcessingTime which is only updated when the call if
> finally processed.
> On the Observer NameNode this can result in RpcQueueTimeNumOps being much
> larger than RpcProcessingTimeNumOps. The re-queueing is an internal
> optimization to avoid blocking and shouldn't result in an inflated metric.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]