[jira] [Comment Edited] (HDDS-14730) SCM listContainer API has performance issues at scale (200K+ containers)

Jason O'Sullivan (Jira) Fri, 27 Feb 2026 08:17:36 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061665#comment-18061665
 ]


Jason O'Sullivan edited comment on HDDS-14730 at 2/27/26 4:15 PM:
------------------------------------------------------------------

* Per-container lock acquisition
 ** Acquires individual striped read lock for each container lookup
 ** At 200K containers: 200K lock acquisitions per RPC
 ** Observed latency variance: 12-107ms for same dataset

The lock acquisition side of this issue has already been addressed by HDDS-12555


was (Author: JIRAUSER310528):
* Per-container lock acquisition
 ** Acquires individual striped read lock for each container lookup
 ** At 200K containers: 200K lock acquisitions per RPC
 ** Observed latency variance: 12-107ms for same dataset

This has already been addressed by HDDS-12555

> SCM listContainer API has performance issues at scale (200K+ containers) 
> -------------------------------------------------------------------------
>
>                 Key: HDDS-14730
>                 URL: https://issues.apache.org/jira/browse/HDDS-14730
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Recon, SCM
>    Affects Versions: 1.4.0
>         Environment: * Production cluster with 200K CLOSED containers
>  * SCM with default configuration
>  * Recon sync interval: 60 seconds
>            Reporter: Jason O'Sullivan
>            Assignee: Jason O'Sullivan
>            Priority: Major
>
> The StorageContainerLocationProtocol.listContainer() API exhibits severe 
> performance degradation at scale, with response times increasing from ~20ms 
> at 20K containers to 1-2 seconds at 200K containers. This primarily affects 
> Recon sync operations but impacts any client listing large numbers of 
> containers.
> h4. Symptoms
>  * Recon sync operations taking minutes to complete
>  * High CPU usage on SCM during container listing
>  * RPC latency spike from <50ms to 1-2 seconds
>  * Customers forced to increase Recon sync interval to 30 days as workaround
> h4. Root Causes Identified
>  * Per-container lock acquisition
>  ** Acquires individual striped read lock for each container lookup
>  ** At 200K containers: 200K lock acquisitions per RPC
>  ** Observed latency variance: 12-107ms for same dataset
>  * Excessive RPC payload
>  ** Returns full ContainerInfo objects
>  ** Recon only needs ContainerIDs for sync
>  ** At 200K containers, ~100MB payload vs 1.6MB payload
> h4. Steps to Reproduce
>  # Create a test cluster with 200K+ closed containers
>  ** Configure small containers to make testing easier
> {code}
> ozone.scm.container.size=2MB
> ozone.scm.block.size=1MB
> {code}
>  ** Generate load
> {code}ozone freon ockg -n=400000 --size=1048576 --thread=100{code}
>  # Enable Recon and observe sync logs
>  {code}grep "Got list of containers from SCM" 
> /var/log/hadoop-ozone/ozone-recon.log{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-14730) SCM listContainer API has performance issues at scale (200K+ containers)

Reply via email to