Jason O'Sullivan created HDDS-14730:
---------------------------------------
Summary: SCM listContainer API has performance issues at scale
(200K+ containers)
Key: HDDS-14730
URL: https://issues.apache.org/jira/browse/HDDS-14730
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Recon, SCM
Affects Versions: 1.4.0
Environment: * Production cluster with 200K CLOSED containers
* SCM with default configuration
* Recon sync interval: 60 seconds
Reporter: Jason O'Sullivan
The StorageContainerLocationProtocol.listContainer() API exhibits severe
performance degradation at scale, with response times increasing from ~20ms at
20K containers to 1-2 seconds at 200K containers. This primarily affects Recon
sync operations but impacts any client listing large numbers of containers.
h4. Symptoms
* Recon sync operations taking minutes to complete
* High CPU usage on SCM during container listing
* RPC latency spike from <50ms to 1-2 seconds
* Customers forced to increase Recon sync interval to 30 days as workaround
h4. Root Causes Identified
* Per-container lock acquisition
** Acquires individual striped read lock for each container lookup
** At 200K containers: 200K lock acquisitions per RPC
** Observed latency variance: 12-107ms for same dataset
* Excessive RPC payload
** Returns full ContainerInfo objects
** Recon only needs ContainerIDs for sync
** At 200K containers, ~100MB payload vs 1.6MB payload
h4. Steps to Reproduce
# Create a test cluster with 200K+ closed containers
** Configure small containers to make testing easier
{code}
ozone.scm.container.size=2MB
ozone.scm.block.size=1MB
{code}
** Generate load
{code}ozone freon ockg -n=400000 --size=1048576 --thread=100{code}
# Enable Recon and observe sync logs
{code}grep "Got list of containers from SCM"
/var/log/hadoop-ozone/ozone-recon.log{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]