sodonnel opened a new pull request #2975:
URL: https://github.com/apache/ozone/pull/2975
## What changes were proposed in this pull request?
Replication Manager processes all the containers in SCM periodically, and so
has a view of the health of the system. However it does not count up the
replicas in each state and hence that overview of the system health is not
easily visible.
This Jira adds a ReplicationManagerReport object, which ReplicationManager
can populate by incrementing various counters as it processes the containers.
The report allows the number of containers in each lifecycle state to be
counted, while also counting the number of containers in various health states,
eg under replicated, over replicated etc. The report also allows a sample of
the container IDs in the state are stored in the report (max of 100 per state)
and these can be extracted later for debugging (a later Jira will provide the
extra feature, probably via an Ozone Admin command).
The report is integrated with the ReplicationManagerMetrics class, and will
add metrics like the following:
```
{
"name" :
"Hadoop:service=StorageContainerManager,name=ReplicationManagerMetrics",
"modelerType" : "ReplicationManagerMetrics",
"tag.Hostname" : "3212ea4aecc5",
"InflightReplication" : 0,
"InflightDeletion" : 0,
"InflightMove" : 0,
## New metrics from here
"NumOpenContainers" : 1,
"NumClosingContainers" : 0,
"NumQuasiClosedContainers" : 2,
"NumClosedContainers" : 0,
"NumDeletingContainers" : 0,
"NumDeletedContainers" : 0,
"NumUnderReplicatedContainers" : 0,
"NumMisReplicatedContainers" : 0,
"NumOverReplicatedContainers" : 0,
"NumMissingContainers" : 0,
"NumUnhealthyContainers" : 0,
"NumEmptyContainers" : 0,
"NumOpenUnhealthyContainers" : 0,
"NumStuckQuasiClosedContainers" : 0,
## end of new metrics
"NumReplicationCmdsSent" : 0,
"NumReplicationCmdsCompleted" : 0,
"NumReplicationCmdsTimeout" : 0,
"NumDeletionCmdsSent" : 0,
"NumDeletionCmdsCompleted" : 0,
"NumDeletionCmdsTimeout" : 0,
"NumReplicationBytesTotal" : 0,
"NumReplicationBytesCompleted" : 0,
"NumDeletionBytesTotal" : 0,
"NumDeletionBytesCompleted" : 0,
"ReplicationTimeNumOps" : 0,
"ReplicationTimeAvgTime" : 0.0,
"DeletionTimeNumOps" : 0,
"DeletionTimeAvgTime" : 0.0
}
```
Note that some of the above metrics (the ones for container LifeCycle state)
duplicate others available from a different source in SCM:
```
{
"name" :
"Hadoop:service=StorageContainerManager,name=SCMContainerMetrics",
"modelerType" : "SCMContainerMetrics",
"tag.Hostname" : "92503dfbd5db",
"OpenContainers" : 0,
"ClosingContainers" : 0,
"QuasiClosedContainers" : 0,
"ClosedContainers" : 0,
"DeletingContainers" : 0,
"DeletedContainers" : 0,
"TotalContainers" : 0
}
```
I am open to removing these duplicates in the new ReplicationManager
metrics, but it may be useful to keep them, as the ReplicationManager counts
are captured at a point in time, and they are calculated differently, and hence
may be helpful in debugging some problems. For that reason, it is still good to
capture them in the Report object, but it is debatable on whether they should
be in the metrics or not.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6170
## How was this patch tested?
New unit tests and validated in docker-compose environment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]