[ 
https://issues.apache.org/jira/browse/HDDS-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654520#comment-17654520
 ] 

Stephen O'Donnell commented on HDDS-7695:
-----------------------------------------

I think the code is working as intended, but it is confusing.

We have a metric for "EcReplicationCmdsSentTotal" and 
EcReconstructionCmdsSentTotal. However on completion or timeout we only have a 
metric EcReplicationCmdsCompletedTotal and EcReplicationCmdsTimeoutTotal - we 
don't have a reconstruction completed / timeout. This is because we track 
completion in ContainerReplicaPendingOps, and all it sees is a replica that has 
been scheduled to be created. It doesn't know if its an simple copy or a 
reconstruction that is going to create it.

That can explain why "EcReplicationCmdsSentTotal=0" and 
"EcReplicationCmdsTimeoutTotal=765" - likely all these scheduled commands were 
actually reconstructions, as we have 571 of those sent.

Why then do we have more ECReplication completed and timed out than scheduled? 
An EC reconstruction can create multiple new replicas in a single command, and 
they are tracked as a single command when sent, but then when the commands are 
completed in pending ops, it counts one per replica. So we can schedule a 
reconstruction to create 2 new replicas, and we will end up with 1 command sent 
and 2 in EcReplicationCmdsCompletedTotal.

I don't see an easy way to fix this, as we don't track the sent commands, we 
just track the pending replicas. Perhaps the EcReplicationCmdsCompletedTotal 
metric could be changed to something like ECReplicationReplicasRecoveredTotal 
and similar for the timeout metric?

> EC metrics related to replication commands don't add up
> -------------------------------------------------------
>
>                 Key: HDDS-7695
>                 URL: https://issues.apache.org/jira/browse/HDDS-7695
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: ECOfflineRecovery, SCM
>    Affects Versions: 1.3.0
>            Reporter: Siddhant Sangwan
>            Priority: Major
>
> {code}
>     "EcReplicationCmdsSentTotal" : 0,
>     "EcDeletionCmdsSentTotal" : 259,
>     "EcReplicationCmdsCompletedTotal" : 51,
>     "EcDeletionCmdsCompletedTotal" : 51,
>     "EcReconstructionCmdsSentTotal" : 571,
>     "EcReplicationCmdsTimeoutTotal" : 765,
>     "EcDeletionCmdsTimeoutTotal" : 204
> {code}
> Total replication commands sent are 0, while timed out are 765.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to