[
https://issues.apache.org/jira/browse/HDDS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-15330:
-------------------------------
Description:
We have previous instances where a new bootstrapped SCM becomes OOM (FYI the
OOM has 96GB heap size).
See the attached MAT tool ParseHeapDump.sh output. The possible issue is that
FCRs are currently stored in both the ContainerReportQueue and buffered in the
RPC handlers (ozone.scm.datanode.handler.count.key default to 100) and also at
most 1 FCR per DN is buffered in ContainerReportQueue. If a lot of DN sends FCR
at the same time (which should happen for newly restarted / bootstrapped SCM),
this will trigger bursty FCR reports from all DNs, which can trigger memory
issue.
HDFS implements a full block reports rate limit in HDFS-7923 to reduce the
concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone
should also implement similar mechanism to prevent FCR storms.
A possible design is that we register DN first, but don't include the full FCR
immediately. SCM grants only N datanodes permission to send FCRs at once,
similar to HDFS implementation.
Another possibility to reduce the single FCR size to to split the FCR to one
FCR per volume (can be considered in the future).
One tradeoff of the rate-limiting is that new SCM might delay the SafeMode
exit. However, this is better than SCM OOM. Another tradeoff is that FCR might
be delayed for large cluster (we need to think about this).
was:
We have previous instances where a new bootstrapped SCM becomes OOM (FYI the
OOM has 96GB heap size).
See the attached MAT tool ParseHeapDump.sh output. The possible issue is that
FCRs are currently stored in both the ContainerReportQueue and buffered in the
RPC handlers (ozone.scm.datanode.handler.count.key default to 100) and also at
most 1 FCR per DN is buffered in ContainerReportQueue. If a lot of DN sends FCR
at the same time (which should happen for newly restarted / bootstrapped SCM),
this will trigger bursty FCR reports from all DNs, which can trigger issue.
HDFS implements a full block reports rate limit in HDFS-7923 to reduce the
concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone
should also implement similar mechanism to prevent FCR storms.
A possible design is that we register DN first, but don't include the full FCR
immediately. SCM grants only N datanodes permission to send FCRs at once,
similar to HDFS implementation.
Another possibility to reduce the single FCR size to to split the FCR to one
FCR per volume (can be considered in the future).
One tradeoff of the rate-limiting is that new SCM might delay the SafeMode
exit. However, this is better than SCM OOM. Another tradeoff is that FCR might
be delayed for large cluster (we need to think about this).
> Implement SCM FCR rate limit
> ----------------------------
>
> Key: HDDS-15330
> URL: https://issues.apache.org/jira/browse/HDDS-15330
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: scm_Leak_Suspects_oom.zip, scm_System_Overview_oom.zip,
> scm_Top_Components_oom.zip
>
>
> We have previous instances where a new bootstrapped SCM becomes OOM (FYI the
> OOM has 96GB heap size).
> See the attached MAT tool ParseHeapDump.sh output. The possible issue is that
> FCRs are currently stored in both the ContainerReportQueue and buffered in
> the RPC handlers (ozone.scm.datanode.handler.count.key default to 100) and
> also at most 1 FCR per DN is buffered in ContainerReportQueue. If a lot of DN
> sends FCR at the same time (which should happen for newly restarted /
> bootstrapped SCM), this will trigger bursty FCR reports from all DNs, which
> can trigger memory issue.
> HDFS implements a full block reports rate limit in HDFS-7923 to reduce the
> concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone
> should also implement similar mechanism to prevent FCR storms.
> A possible design is that we register DN first, but don't include the full
> FCR immediately. SCM grants only N datanodes permission to send FCRs at once,
> similar to HDFS implementation.
> Another possibility to reduce the single FCR size to to split the FCR to one
> FCR per volume (can be considered in the future).
> One tradeoff of the rate-limiting is that new SCM might delay the SafeMode
> exit. However, this is better than SCM OOM. Another tradeoff is that FCR
> might be delayed for large cluster (we need to think about this).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]