[jira] [Created] (HDDS-14730) SCM listContainer API has performance issues at scale (200K+ containers)

Jason O'Sullivan (Jira) Thu, 26 Feb 2026 07:34:06 -0800

Jason O'Sullivan created HDDS-14730:
---------------------------------------


             Summary: SCM listContainer API has performance issues at scale 
(200K+ containers) 
                 Key: HDDS-14730
                 URL: https://issues.apache.org/jira/browse/HDDS-14730
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Recon, SCM
    Affects Versions: 1.4.0
         Environment: * Production cluster with 200K CLOSED containers

 * SCM with default configuration

 * Recon sync interval: 60 seconds
            Reporter: Jason O'Sullivan


The StorageContainerLocationProtocol.listContainer() API exhibits severe 
performance degradation at scale, with response times increasing from ~20ms at 
20K containers to 1-2 seconds at 200K containers. This primarily affects Recon 
sync operations but impacts any client listing large numbers of containers.
h4. Symptoms
 * Recon sync operations taking minutes to complete

 * High CPU usage on SCM during container listing

 * RPC latency spike from <50ms to 1-2 seconds

 * Customers forced to increase Recon sync interval to 30 days as workaround

h4. Root Causes Identified
 * Per-container lock acquisition
 ** Acquires individual striped read lock for each container lookup
 ** At 200K containers: 200K lock acquisitions per RPC
 ** Observed latency variance: 12-107ms for same dataset
 * Excessive RPC payload
 ** Returns full ContainerInfo objects
 ** Recon only needs ContainerIDs for sync
 ** At 200K containers, ~100MB payload vs 1.6MB payload

h4. Steps to Reproduce
 # Create a test cluster with 200K+ closed containers
 ** Configure small containers to make testing easier
{code}
ozone.scm.container.size=2MB
ozone.scm.block.size=1MB
{code}
 ** Generate load
{code}ozone freon ockg -n=400000 --size=1048576 --thread=100{code}
 # Enable Recon and observe sync logs
 {code}grep "Got list of containers from SCM" 
/var/log/hadoop-ozone/ozone-recon.log{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-14730) SCM listContainer API has performance issues at scale (200K+ containers)

Reply via email to