[ 
https://issues.apache.org/jira/browse/HDDS-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell resolved HDDS-6314.
-------------------------------------
    Fix Version/s: 1.3.0
       Resolution: Fixed

> ConcurrentModificationException getting SCM Container Metrics
> -------------------------------------------------------------
>
>                 Key: HDDS-6314
>                 URL: https://issues.apache.org/jira/browse/HDDS-6314
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Stephen O'Donnell
>            Assignee: Aswin Shakil Balasubramanian
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.3.0
>
>
> Benchmarking the ContainerReports, I noticed the following 
> ConcurrentModificationException in the SCM logs. As I understand it, this 
> will happen anytime the metrics are built and a new container is added at the 
> same time:
> {code:java}
> scm_1       | 2022-02-13 21:32:40,505 [Timer for 'StorageContainerManager' 
> metrics system] ERROR impl.MetricsSourceAdapter: Error getting metrics from 
> source SCMContainerMetrics
> scm_1       | java.util.ConcurrentModificationException
> scm_1       |     at 
> java.base/java.util.TreeMap$KeySpliterator.forEachRemaining(TreeMap.java:2750)
> scm_1       |     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> scm_1       |     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> scm_1       |     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> scm_1       |     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> scm_1       |     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> scm_1       |     at 
> org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getContainers(ContainerManagerImpl.java:172)
> scm_1       |     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.getContainerStateCount(StorageContainerManager.java:1764)
> scm_1       |     at 
> org.apache.hadoop.hdds.scm.server.SCMContainerMetrics.getMetrics(SCMContainerMetrics.java:68)
> scm_1       |     at 
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
> scm_1       |     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:423)
> scm_1       |     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:410)
> scm_1       |     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:385)
> scm_1       |     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:372)
> scm_1       |     at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
> scm_1       |     at java.base/java.util.TimerThread.run(Timer.java:506)
> {code}
> The metrics code calls into ContainerManagerImpl.getContainers:
> {code:java}
>   @Override
>   public List<ContainerInfo> getContainers(final LifeCycleState state) {
>     return containerStateManager.getContainerIDs(state).stream()
>         .map(ContainerID::getProtobuf)
>         .map(containerStateManager::getContainer)
>         .filter(Objects::nonNull).collect(Collectors.toList());
>   }
> {code}
> This is quite an expensive operation for any containers where there are a 
> large number in a state (eg CLOSED containers), as we take a TreeSet stored 
> in ContainerStateManager, map the ContainerID to proto, only for it to be 
> mapped back to a ContainerID again inside 
> containerStateManager::getContainer. Then it checks if the container exists 
> and pulls the results into a new list. There the metrics manager just calls 
> size on it.
> This work does not seem necessary, as ContainerStateMap maintains the 
> structures for a container under a lock and hence if the container is in the 
> state map, it must be in the container map too.
> To avoid this seemingly expensive operation on a large cluster with millions 
> of containers, we should probably provide a `getContainerStateCount(final 
> LifeCycleState state)` API into ContainerManager where it ultimately calls 
> `size` on the relevant map inside ContainerStateMap.
> However the two getContainers(...) methods in ContainerManagerImpl are not 
> threadsafe as they stand, and are used in other places (eg from the CLI to 
> list containers), so they need fixed too.
> Additionally, in SCMClientProtocolServer.listContainer, the code does not 
> seem to use the fact that the "container lists by state" are already sorted 
> and in TreeMaps, so they could perform the paging much more efficiently. I 
> fear we are potentially forming some very large lists twice per call, only to 
> return a small slice of the list each time. I have raised HDDS-6315 for this 
> issue, to keep the issues separate.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to