[
https://issues.apache.org/jira/browse/HDDS-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497762#comment-17497762
]
Shawn commented on HDDS-6345:
-----------------------------
Still seeing the same issue. OM got killed for OOM. When the OM is in this
status, the WAL file could be tens of GBs, which seems not constraint by the
{{max_total_wal_size}} config
> OM always runs OOM in Kubernetes
> ---------------------------------
>
> Key: HDDS-6345
> URL: https://issues.apache.org/jira/browse/HDDS-6345
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Shawn
> Priority: Major
>
> I deployed ozone 1.21 to kubernetes with security enabled and with OM HA and
> SCM HA. However, one of the OM always gets restarted by Kubernetes because of
> OOM. Even I assigned 300GB memory, the OM still keeps restarting for OOM.
>
> After analysis, we found the OOM was because of rocksDB. When OM gets
> restarted, it first tries to open rocksDB. And during this time, rocksDB
> tries to do compaction, which eventually got OOM. So there are three question:
>
> 1. Why the OM got into this status?
> 2. Why rocksDB needs so much memory to do the compaction?
> 3. How to resolve this?
> Some info maybe useful for you. We directly deploy OM HA, not migrate from
> one OM to HA OM. The OM that has issues is a follower, not a leader. The
> underlying PVC we are using is SSD. Our traffic is mostly large objects, with
> size of hundreds GBs.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]