[jira] [Commented] (HDDS-6345) OM always runs OOM in Kubernetes

Shawn (Jira) Mon, 21 Feb 2022 13:19:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495780#comment-17495780
 ]


Shawn commented on HDDS-6345:
-----------------------------

And the last few lines of the LOG file is as below. It was stuck on recovering 
the log 197184, which is 160GB.


{code:java}
2022/02/17-14:12:18.398446 70000251a000 EVENT_LOG_v1 {"time_micros": 
1645135938398430, "job": 1, "event": "recovery_started", "wal_files": [197133, 
197136, 197139, 197141, 197143, 197146, 197149, 197151, 197153, 197156, 197159, 
197161, 197163, 197166, 197169, 197171, 197173, 197176, 197180, 197182, 197184]}
2022/02/17-14:12:18.398453 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197133 mode 2
2022/02/17-14:12:18.571055 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197136 mode 2
2022/02/17-14:12:18.731164 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197139 mode 2
2022/02/17-14:12:18.880858 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197141 mode 2
2022/02/17-14:12:19.052716 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197143 mode 2
2022/02/17-14:12:19.228465 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197146 mode 2
2022/02/17-14:12:19.399539 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197149 mode 2
2022/02/17-14:12:19.586655 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197151 mode 2
2022/02/17-14:12:19.756244 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197153 mode 2
2022/02/17-14:12:19.905050 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197156 mode 2
2022/02/17-14:12:20.072435 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197159 mode 2
2022/02/17-14:12:20.258739 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197161 mode 2
2022/02/17-14:12:20.430191 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197163 mode 2
2022/02/17-14:12:20.590025 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197166 mode 2
2022/02/17-14:12:20.754653 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197169 mode 2
2022/02/17-14:12:20.924973 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197171 mode 2
2022/02/17-14:12:21.160045 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197173 mode 2
2022/02/17-14:12:21.447427 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197176 mode 2
2022/02/17-14:12:22.195050 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197180 mode 2
2022/02/17-14:12:29.191599 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197182 mode 2
2022/02/17-14:15:36.604367 70000251a000 [/db_impl/db_impl_open.cc:870] 
Recovering log #197184 mode 2 {code}

> OM always runs OOM in Kubernetes 
> ---------------------------------
>
>                 Key: HDDS-6345
>                 URL: https://issues.apache.org/jira/browse/HDDS-6345
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Shawn
>            Priority: Major
>
> I deployed ozone 1.21 to kubernetes  with security enabled and with OM HA and 
> SCM HA. However, one of the OM always gets restarted by Kubernetes because of 
> OOM. Even I assigned 300GB memory, the OM still keeps restarting for OOM.
>  
> After analysis, we found the OOM was because of rocksDB. When OM gets 
> restarted, it first tries to open rocksDB. And during this time, rocksDB 
> tries to do compaction, which eventually got OOM. So there are three question:
>  
> 1. Why the OM got into this status?
> 2. Why rocksDB needs so much memory to do the compaction?
> 3. How to resolve this?
> Some info maybe useful for you. We directly deploy OM HA, not migrate from 
> one OM to HA OM. The OM that has issues is a follower, not a leader. The 
> underlying PVC we are using is SSD. Our traffic is mostly large objects, with 
> size of hundreds GBs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-6345) OM always runs OOM in Kubernetes

Reply via email to