Stephen O'Donnell created HDDS-6314:
---------------------------------------
Summary: ConcurrentModificationException getting SCM Container
Metrics
Key: HDDS-6314
URL: https://issues.apache.org/jira/browse/HDDS-6314
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Reporter: Stephen O'Donnell
Benchmarking the ContainerReports, I noticed the following
ConcurrentModificationException in the SCM logs. As I understand it, this will
happen anytime the metrics are built and a new container is added at the same
time:
{code}
scm_1 | 2022-02-13 21:32:40,505 [Timer for 'StorageContainerManager'
metrics system] ERROR impl.MetricsSourceAdapter: Error getting metrics from
source SCMContainerMetrics
scm_1 | java.util.ConcurrentModificationException
scm_1 | at
java.base/java.util.TreeMap$KeySpliterator.forEachRemaining(TreeMap.java:2750)
scm_1 | at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
scm_1 | at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
scm_1 | at
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
scm_1 | at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
scm_1 | at
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
scm_1 | at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getContainers(ContainerManagerImpl.java:172)
scm_1 | at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.getContainerStateCount(StorageContainerManager.java:1764)
scm_1 | at
org.apache.hadoop.hdds.scm.server.SCMContainerMetrics.getMetrics(SCMContainerMetrics.java:68)
scm_1 | at
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
scm_1 | at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:423)
scm_1 | at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:410)
scm_1 | at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:385)
scm_1 | at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:372)
scm_1 | at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
scm_1 | at java.base/java.util.TimerThread.run(Timer.java:506)
{code}
The metrics code calls into ContainerManagerImpl.getContainers:
{code}
@Override
public List<ContainerInfo> getContainers(final LifeCycleState state) {
return containerStateManager.getContainerIDs(state).stream()
.map(ContainerID::getProtobuf)
.map(containerStateManager::getContainer)
.filter(Objects::nonNull).collect(Collectors.toList());
}
{code}
This is quite an expensive operation for any containers where there are a large
number in a state (eg CLOSED containers), as we take a TreeSet stored in
ContainerStateManager, map the ContainerID to proto, only for it to be mapped
back to a ContainerID again inside containerStateManager::getContainer. Then it
checks if the container exists and pulls the results into a new list. There the
metrics manager just calls size on it.
This work does not seem necessary, as ContainerStateMap maintains the
structures for a container under a lock and hence if the container is in the
state map, it must be in the container map too.
To avoid this seemingly expensive operation on a large cluster with millions of
containers, we should probably provide a `getContainerStateCount(final
LifeCycleState state)` API into ContainerManager where it ultimately calls
`size` on the relevant map inside ContainerStateMap.
However the two getContainers(...) methods in ContainerManagerImpl are not
threadsafe as they stand, and are used in other places (eg from the CLI to list
containers), so they need fixed too.
Additionally, in SCMClientProtocolServer.listContainer, the code does not seem
to use the fact that the "container lists by state" are already sorted and in
TreeMaps, so they could perform the paging much more efficiently. I fear we are
potentially forming some very large lists twice per call, only to return a
small slice of the list each time.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]