[
https://issues.apache.org/jira/browse/HDDS-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen O'Donnell resolved HDDS-6314.
-------------------------------------
Fix Version/s: 1.3.0
Resolution: Fixed
> ConcurrentModificationException getting SCM Container Metrics
> -------------------------------------------------------------
>
> Key: HDDS-6314
> URL: https://issues.apache.org/jira/browse/HDDS-6314
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Stephen O'Donnell
> Assignee: Aswin Shakil Balasubramanian
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.3.0
>
>
> Benchmarking the ContainerReports, I noticed the following
> ConcurrentModificationException in the SCM logs. As I understand it, this
> will happen anytime the metrics are built and a new container is added at the
> same time:
> {code:java}
> scm_1 | 2022-02-13 21:32:40,505 [Timer for 'StorageContainerManager'
> metrics system] ERROR impl.MetricsSourceAdapter: Error getting metrics from
> source SCMContainerMetrics
> scm_1 | java.util.ConcurrentModificationException
> scm_1 | at
> java.base/java.util.TreeMap$KeySpliterator.forEachRemaining(TreeMap.java:2750)
> scm_1 | at
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> scm_1 | at
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> scm_1 | at
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> scm_1 | at
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> scm_1 | at
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> scm_1 | at
> org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getContainers(ContainerManagerImpl.java:172)
> scm_1 | at
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.getContainerStateCount(StorageContainerManager.java:1764)
> scm_1 | at
> org.apache.hadoop.hdds.scm.server.SCMContainerMetrics.getMetrics(SCMContainerMetrics.java:68)
> scm_1 | at
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
> scm_1 | at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:423)
> scm_1 | at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:410)
> scm_1 | at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:385)
> scm_1 | at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:372)
> scm_1 | at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
> scm_1 | at java.base/java.util.TimerThread.run(Timer.java:506)
> {code}
> The metrics code calls into ContainerManagerImpl.getContainers:
> {code:java}
> @Override
> public List<ContainerInfo> getContainers(final LifeCycleState state) {
> return containerStateManager.getContainerIDs(state).stream()
> .map(ContainerID::getProtobuf)
> .map(containerStateManager::getContainer)
> .filter(Objects::nonNull).collect(Collectors.toList());
> }
> {code}
> This is quite an expensive operation for any containers where there are a
> large number in a state (eg CLOSED containers), as we take a TreeSet stored
> in ContainerStateManager, map the ContainerID to proto, only for it to be
> mapped back to a ContainerID again inside
> containerStateManager::getContainer. Then it checks if the container exists
> and pulls the results into a new list. There the metrics manager just calls
> size on it.
> This work does not seem necessary, as ContainerStateMap maintains the
> structures for a container under a lock and hence if the container is in the
> state map, it must be in the container map too.
> To avoid this seemingly expensive operation on a large cluster with millions
> of containers, we should probably provide a `getContainerStateCount(final
> LifeCycleState state)` API into ContainerManager where it ultimately calls
> `size` on the relevant map inside ContainerStateMap.
> However the two getContainers(...) methods in ContainerManagerImpl are not
> threadsafe as they stand, and are used in other places (eg from the CLI to
> list containers), so they need fixed too.
> Additionally, in SCMClientProtocolServer.listContainer, the code does not
> seem to use the fact that the "container lists by state" are already sorted
> and in TreeMaps, so they could perform the paging much more efficiently. I
> fear we are potentially forming some very large lists twice per call, only to
> return a small slice of the list each time. I have raised HDDS-6315 for this
> issue, to keep the issues separate.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]