[ 
https://issues.apache.org/jira/browse/HDDS-7936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754937#comment-17754937
 ] 

Sumit Agrawal commented on HDDS-7936:
-------------------------------------

[~duongnguyen]
 * FCR is already handled, 1 FCR per DN is kept in queue.
 * ICR can not be removed for certain cases like delete container can not be 
truncated,
 ** as FCR not having delete container but present at SCM can show as missing 
container.

FCR was the major contribution for OOM of SCM, as it includes all closed 
container report which will be high over period of time.

 

IMO, this is not required to be handled and this can be closed. Once we face 
some problem, this can be looked...

[~duongnguyen] share your suggestion ...

> Better control of SCM container report queue size
> -------------------------------------------------
>
>                 Key: HDDS-7936
>                 URL: https://issues.apache.org/jira/browse/HDDS-7936
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 1.3.0
>            Reporter: Duong
>            Assignee: Sumit Agrawal
>            Priority: Major
>
> Today, by default SCM creates 10 in-memory blocking queues for the container 
> report handlers (each queue reserved for a handler thread). The queues are 
> shared for Full Container Report (FCR) and Incremental Container Report 
> (ICR).  By default, the max_capacity is set to *100,000* per queue, 
> [ref|https://github.com/apache/ozone/blob/master/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfigKeys.java#L499-L499].
> {code:java}
> public static List<BlockingQueue<ContainerReport>> initContainerReportQueue(
>     OzoneConfiguration configuration) {
>   int threadPoolSize = configuration.getInt(getContainerReportConfPrefix()
>           + ".thread.pool.size",
>       OZONE_SCM_EVENT_THREAD_POOL_SIZE_DEFAULT);
>   int queueSize = configuration.getInt(getContainerReportConfPrefix()
>           + ".queue.size",
>       OZONE_SCM_EVENT_CONTAINER_REPORT_QUEUE_SIZE_DEFAULT);
>   List<BlockingQueue<ContainerReport>> queues = new ArrayList<>();
>   for (int i = 0; i < threadPoolSize; ++i) {
>     queues.add(new ContainerReportQueue(queueSize));
>   }
>   return queues;
> } {code}
> That indicates a few potential problems:
>  # The number of possible queued messages can reach {*}1M{*}, which is 
> {*}HUGE{*}. Assuming each ICR message is 10KB, 1M ICR messages are 10GB. FCRs 
> are much bigger as each many container reports, and usually is a matter of 
> {*}MBs{*}. 10K FCR messages are enough to swallow all SCM heap memory.
> Such a default configuration indeed doesn't limit any boundary for the 
> container report queues. And that opens a good chance for SCM to choke and 
> crash if it slows down the consumption of reports at any point, or in any 
> event that many datanodes send ICR at the same time.
>  # As mentioned above, ICR and FCR are very different in size (and probably 
> in the ability of SCM to process). It's not good to use the same queue for 
> both types. Separation is needed to differentiate queue size control for them.
> To better protect:
>  # Separate ICR and FCR queues.
>  # Set the right default boundary for each queue type. *5000* (in total) for 
> {*}FCR{*}, and *20,000* for *ICR* can be a good start.
>  # Do we really need multiple queues per message type? The BlockingQueue is 
> designed for the scene of using a single queue to sever multiple 
> producers/consumers. If there's no particular reason to do so, I suggest 
> simplifying the model to use a single queue for a message handler.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to