[jira] [Commented] (HDDS-10402) OM unstable with long jvm pauses

Ethan Rose (Jira) Mon, 26 Feb 2024 11:52:46 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17820836#comment-17820836
 ]


Ethan Rose commented on HDDS-10402:
-----------------------------------

[~sri9] does the issue still occur with Ozone 1.4.0? I blieve there some memory 
fixes in the OM in that release that 1.3.0 does not have.

> OM unstable with long jvm pauses
> --------------------------------
>
>                 Key: HDDS-10402
>                 URL: https://issues.apache.org/jira/browse/HDDS-10402
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: 1.3, Ozone Manager
>    Affects Versions: 1.3.0
>         Environment: *Any*
>            Reporter: sri
>            Assignee: Sadanand Shenoy
>            Priority: Major
>             Fix For: 1.3.0
>
>
> When we restart Ozone Manager (OM), we noticed considerable degradation of 
> ozone performance. Specifically the Read/Write semantics are slower than 
> normal. Also we see following repeated errors in OM logs.
>  
> +*>> Error Log: (OM)*+
> 2024-02-15 11:36:05,949 INFO 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Received 
> Configuration change notification from Ratis. New Peer list:
> [id: "om1"
> address: "xyz:9872"
> startupRole: LEADER  ----> A (New) --> Token auth failing om user
> , id: "om3"
> address: "B:9872"
> startupRole: FOLLOWER
> , id: "om2"
> address: "C:9872"
> startupRole: FOLLOWER  ----> C (New) --> Token auth failing om user
> ]
> 2024-02-15 14:02:20,852 WARN SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth failed for abc:45174:null (DIGEST-MD5: IO error acquiring password) with 
> true cause: (om1 is Leader but not ready to process request yet.)
> 2024-02-15 14:02:20,852 WARN SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth failed for xyz:42414:null (DIGEST-MD5: IO error acquiring password) with 
> true cause: (om1 is Leader but not ready to process request yet.)
>  
> +*>> Long & persistent long jvm pause cycles during Leader election process:*+
> 2024-02-15 11:36:05,892 INFO org.apache.ratis.server.impl.RoleInfo: om1: 
> *shutdown om1@group-3B1F193E2D90-LeaderStateImpl*
> 2024-02-15 11:36:05,893 WARN org.apache.ratis.util.JvmPauseMonitor: 
> JvmPauseMonitor-om1: *Detected pause in JVM or host machine (eg GC): pause of 
> approximately 19274374277ns.*
>  
> {*}+>> Recon Log:+{*}{*}{*}
> 2024-02-15 23:21:09,029 ERROR 
> org.apache.hadoop.ozone.recon.tasks.OMDBUpdatesHandler: Exception when 
> reading key :
> java.io.IOException: Rocks Database is closed
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:407)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.get(RocksDatabase.java:641)
>         at org.apache.hadoop.hdds.utils.db.RDBTable.get(RDBTable.java:110)
>         at org.apache.hadoop.hdds.utils.db.RDBTable.get(RDBTable.java:40)
>         at 
> org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:255)
>         at 
> org.apache.hadoop.hdds.utils.db.TypedTable.getSkipCache(TypedTable.java:195)
>         at 
> org.apache.hadoop.ozone.recon.tasks.OMDBUpdatesHandler.processEvent(OMDBUpdatesHandler.java:128)
>         at 
> org.apache.hadoop.ozone.recon.tasks.OMDBUpdatesHandler.put(OMDBUpdatesHandler.java:67)
>         at org.rocksdb.WriteBatch.iterate(Native Method)
>         at org.rocksdb.WriteBatch.iterate(WriteBatch.java:63)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-10402) OM unstable with long jvm pauses

Reply via email to