[ 
https://issues.apache.org/jira/browse/HDDS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-15330:
-------------------------------
    Description: 
We have previous instances where a new bootstrapped SCM encountered OOM (FYI 
the OOM has 96GB heap size). 

See the attached MAT tool ParseHeapDump.sh output. The possible issue is that 
FCRs are currently stored in 
1. ContainerReportQueue (at most 1, this accounts to 15.2 GB) 
2. Hadoop RPC handlers / Server$Handler(ozone.scm.datanode.handler.count.key 
default to 100 and this accounts to 39.7 GB)
3. Hadoop RPC server call queue / ProtobufRpcEngine$Server (this accounts for 
9.4 GB)
4. Hadoop RPC connection / Server$Connection  (this accounts for 6.0 GB)

If a lot of DN sends FCR at the same time (which should happen for newly 
restarted / bootstrapped SCM), this will trigger bursty FCR reports from all 
DNs, which can trigger memory issue.

HDFS implements a full block reports rate limit in HDFS-7923 to reduce the 
concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone 
should also implement similar mechanism to prevent FCR storms. A possible 
design is that we register DN first, but don't include the full FCR 
immediately. SCM grants only N datanodes permission to send FCRs at once, 
similar to HDFS implementation. This way, we can ensure that N FCR can exists 
at one time in SCM heap memory.

Another possibility to reduce the single FCR size to to split the FCR to one 
FCR per volume (can be considered in the future). 

One tradeoff of the rate-limiting is that new SCM might delay the SafeMode 
exit. However, this is better than SCM OOM. Another tradeoff is that FCR might 
be delayed for large cluster (we need to think about this). Another issue 
mentioned in HDDS-7923 is that there might be some DNs that might take a long 
time to send its FCR to the SCM since it never gets the lease, but seems HDFS 
has not handled it as well.

  was:
We have previous instances where a new bootstrapped SCM encountered OOM (FYI 
the OOM has 96GB heap size). 

See the attached MAT tool ParseHeapDump.sh output. The possible issue is that 
FCRs are currently stored in 
1. ContainerReportQueue (at most 1, this accounts to 15.2 GB) 
2. Hadoop RPC handlers / Server$Handler(ozone.scm.datanode.handler.count.key 
default to 100 and this accounts to 39.7 GB)
3. Hadoop RPC server call queue / ProtobufRpcEngine$Server (this accounts for 
9.4 GB)
4. Hadoop RPC connection / Server$Connection  (this accounts for 6.0 GB)

If a lot of DN sends FCR at the same time (which should happen for newly 
restarted / bootstrapped SCM), this will trigger bursty FCR reports from all 
DNs, which can trigger memory issue.

HDFS implements a full block reports rate limit in HDFS-7923 to reduce the 
concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone 
should also implement similar mechanism to prevent FCR storms. A possible 
design is that we register DN first, but don't include the full FCR 
immediately. SCM grants only N datanodes permission to send FCRs at once, 
similar to HDFS implementation. This way, we can ensure that N FCR can exists 
at one time in SCM heap memory.

Another possibility to reduce the single FCR size to to split the FCR to one 
FCR per volume (can be considered in the future). 

One tradeoff of the rate-limiting is that new SCM might delay the SafeMode 
exit. However, this is better than SCM OOM. Another tradeoff is that FCR might 
be delayed for large cluster (we need to think about this).


> Implement SCM FCR rate limit
> ----------------------------
>
>                 Key: HDDS-15330
>                 URL: https://issues.apache.org/jira/browse/HDDS-15330
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>         Attachments: scm_Leak_Suspects_oom.zip, scm_System_Overview_oom.zip, 
> scm_Top_Components_oom.zip
>
>
> We have previous instances where a new bootstrapped SCM encountered OOM (FYI 
> the OOM has 96GB heap size). 
> See the attached MAT tool ParseHeapDump.sh output. The possible issue is that 
> FCRs are currently stored in 
> 1. ContainerReportQueue (at most 1, this accounts to 15.2 GB) 
> 2. Hadoop RPC handlers / Server$Handler(ozone.scm.datanode.handler.count.key 
> default to 100 and this accounts to 39.7 GB)
> 3. Hadoop RPC server call queue / ProtobufRpcEngine$Server (this accounts for 
> 9.4 GB)
> 4. Hadoop RPC connection / Server$Connection  (this accounts for 6.0 GB)
> If a lot of DN sends FCR at the same time (which should happen for newly 
> restarted / bootstrapped SCM), this will trigger bursty FCR reports from all 
> DNs, which can trigger memory issue.
> HDFS implements a full block reports rate limit in HDFS-7923 to reduce the 
> concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone 
> should also implement similar mechanism to prevent FCR storms. A possible 
> design is that we register DN first, but don't include the full FCR 
> immediately. SCM grants only N datanodes permission to send FCRs at once, 
> similar to HDFS implementation. This way, we can ensure that N FCR can exists 
> at one time in SCM heap memory.
> Another possibility to reduce the single FCR size to to split the FCR to one 
> FCR per volume (can be considered in the future). 
> One tradeoff of the rate-limiting is that new SCM might delay the SafeMode 
> exit. However, this is better than SCM OOM. Another tradeoff is that FCR 
> might be delayed for large cluster (we need to think about this). Another 
> issue mentioned in HDDS-7923 is that there might be some DNs that might take 
> a long time to send its FCR to the SCM since it never gets the lease, but 
> seems HDFS has not handled it as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to