[
https://issues.apache.org/jira/browse/HDDS-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-6307:
---------------------------------
Labels: pull-request-available (was: )
> Improve processing and memory efficiency of container reports
> -------------------------------------------------------------
>
> Key: HDDS-6307
> URL: https://issues.apache.org/jira/browse/HDDS-6307
> Project: Apache Ozone
> Issue Type: Improvement
> Components: SCM
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
> Labels: pull-request-available
>
> The container report handing code has some issues which this Jira intends to
> address.
> When handling full container reports, we make several copies of large sets to
> identify containers in SCM, but not reported by the DN (they have somehow got
> lost on the DNs).
> Looking at the current code:
> {code}
> synchronized (datanodeDetails) {
> final List<ContainerReplicaProto> replicas =
> containerReport.getReportsList();
> final Set<ContainerID> containersInSCM =
> nodeManager.getContainers(datanodeDetails);
> final Set<ContainerID> containersInDn = replicas.parallelStream()
> .map(ContainerReplicaProto::getContainerID)
> .map(ContainerID::valueOf).collect(Collectors.toSet());
> final Set<ContainerID> missingReplicas = new
> HashSet<>(containersInSCM);
> missingReplicas.removeAll(containersInDn);
> processContainerReplicas(datanodeDetails, replicas, publisher);
> processMissingReplicas(datanodeDetails, missingReplicas);
> /*
> * Update the latest set of containers for this datanode in
> * NodeManager
> */
> nodeManager.setContainers(datanodeDetails, containersInDn);
> containerManager.notifyContainerReportProcessing(true, true);
> }
> {code}
> The Set "containersInSCM" comes from NodeStateMap:
> {code}
> public Set<ContainerID> getContainers(UUID uuid)
> throws NodeNotFoundException {
> lock.readLock().lock();
> try {
> checkIfNodeExist(uuid);
> return Collections
> .unmodifiableSet(new HashSet<>(nodeToContainer.get(uuid)));
> } finally {
> lock.readLock().unlock();
> }
> }
> {code}
> This returns a new copy of the set, so there is no need to wrap it
> UnModifiable. This means we can avoid copying it again in the report Handler.
> The current code ends up with 3 copies of this potentially large set.
> Next we take the FCR and stream it into a set of ContainerID. This is used
> for two purposes - 1, to subract from "containersInSCM" to yield the missing
> containers. 2, to replace the list already in nodeMnager.
> We can avoid this second large set, by simply adding each ContainerID to
> nodeManager if it is not already there and then remove any "missing" from
> nodeManager at the end. This also avoids replacing the entire set of
> ContainersIDs and all the tenured objects associated with it with new
> effectively identical object instances.
> Finally we can process the replicas at the same time, and re-use the
> ContainerID object, which we currently form 2 times. We store one copy in the
> nodeManager, and then another distinct copy in each ContainerReplica.
> I checked how many ContainerID objects are present in SCM, by loading 10
> closed containers with 3 replicas, and capturing a heap histogram with jmap:
> {code}
> bash-4.2$ jmap -histo:live 8 | grep ContainerID
> 262: 81 1944
> org.apache.hadoop.hdds.scm.container.ContainerID
> 301: 30 1440
> org.apache.hadoop.hdds.scm.container.ContainerReplica
> 419: 10 800
> org.apache.hadoop.hdds.scm.container.ContainerInfo
> {code}
> There are 10 Containers as expected, 30 replicas, but 81 ContainerID objects.
> Ideally there should be 10, to match the number of Containers, but some of
> these may be created from other code paths, such as pipeline creation.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]