[ 
https://issues.apache.org/jira/browse/HDDS-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707181#comment-17707181
 ] 

Sadanand Shenoy edited comment on HDDS-8271 at 3/31/23 8:13 AM:
----------------------------------------------------------------

I tried to repro this by flushing the DB before calling getUpdatesSince , The 
result was that the log file got cleaned up and result of getUpdatesSince was 0 
instead of the expected value. This behaviour is as per the javadoc stated for 
the API that we need to set high retention time for WAL for this API to return 
a valid iterator but in no case I could reproduce the crash, 

Also the crash occurs when calling getLatestSequenceNumber in rocksdb code and 
not getUpdatesSince,
{code:java}
Stack: [0x00007fb6b8883000,0x00007fb6b8984000],  sp=0x00007fb6b89817f8,  free 
space=1017k
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.rocksdb.RocksDB.getLatestSequenceNumber(J)J+0
j  org.rocksdb.RocksDB.getLatestSequenceNumber()J+5
j  org.apache.hadoop.hdds.utils.db.RocksDatabase.getLatestSequenceNumber()J+18 
{code}
It looks like some race condition in rocksdb code b/w wal deletion and reading 
from WAL. I see a similar [issue reported 
here|https://github.com/facebook/rocksdb/pull/8573] for getSortedWAL.


was (Author: sadanand_shenoy):
I tried to repro this by flushing the DB before calling getUpdatesSince , The 
result was that the log file got cleaned up and result of getUpdatesSince was 0 
instead of the expected value. This behaviour is as per the javadoc stated for 
the API that we need to set high retention time for WAL for this API to return 
a valid iterator but in no case I could reproduce the crash, 

Also the crash occurs when calling getLatestSequenceNumber in rocksdb code and 
not getUpdatesSince, It looks like some race condition in rocksdb code b/w wal 
deletion and reading from WAL. I see a similar [issue reported 
here|https://github.com/facebook/rocksdb/pull/8573] for getSortedWAL.

> Ozone Manager crashed caused by rocksdb closed
> ----------------------------------------------
>
>                 Key: HDDS-8271
>                 URL: https://issues.apache.org/jira/browse/HDDS-8271
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM
>    Affects Versions: 1.3.0
>            Reporter: Hongbing Wang
>            Assignee: Sadanand Shenoy
>            Priority: Critical
>
> An OM crashed, its log has:
> {noformat}
> 2023-03-23 20:32:06,079 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolled log segment from 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61913753
>  to 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_61913753-61920134
> 2023-03-23 20:32:06,087 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: created new log segment 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61920135
> 2023-03-23 20:32:13,820 [om3-client-thread180] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolling segment 
> log-61920135_61926451 to index:61926451
> 2023-03-23 20:32:13,821 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolled log segment from 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61920135
>  to 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_61920135-61926451
> 2023-03-23 20:32:13,827 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: created new log segment 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61926452
> 2023-03-23 20:32:15,699 [Finalizer] WARN 
> org.apache.hadoop.hdds.utils.db.managed.ManagedRocksObjectUtils: 
> RocksIterator is not closed properly
> 2023-03-23 20:32:15,699 [Finalizer] WARN 
> org.apache.hadoop.hdds.utils.db.managed.ManagedRocksObjectUtils: 
> RocksIterator is not closed properly
> 2023-03-23 20:32:21,281 [om3-client-thread183] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolling segment 
> log-61926452_61932758 to index:61932758
> 2023-03-23 20:32:21,281 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolled log segment from 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61926452
>  to 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_61926452-61932758
> 2023-03-23 20:32:21,288 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: created new log segment 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61932759
> 2023-03-23 20:32:28,387 [om3-client-thread160] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolling segment 
> log-61932759_61939170 to index:61939170
> 2023-03-23 20:32:28,388 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: Rolled log segment from 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61932759
>  to 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_61932759-61939170
> 2023-03-23 20:32:28,395 [om3@group-197E298202B9-SegmentedRaftLogWorker] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om3@group-197E298202B9-SegmentedRaftLogWorker: created new log segment 
> /data/ozonedata/om/ratis/e961d78e-fe7e-32f2-a53f-197e298202b9/current/log_inprogress_61939171
> 2023-03-23 20:32:29,003 [OM StateMachine ApplyTransaction Thread - 0] ERROR 
> org.apache.hadoop.ozone.om.KeyManagerImpl: CheckAccess operation failed for 
> key:vol-XXX/prod-XXX/FILE_2022-03-28-13-38-36_10003/_2022-03-28-13-38-45_0.FILE
> java.io.IOException: Rocks Database is closed
>       at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:395)
>       at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.get(RocksDatabase.java:517)
>       at org.apache.hadoop.hdds.utils.db.RDBTable.get(RDBTable.java:110)
>       at org.apache.hadoop.hdds.utils.db.RDBTable.get(RDBTable.java:40)
>       at 
> org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:255)
>       at org.apache.hadoop.hdds.utils.db.TypedTable.get(TypedTable.java:181)
>       at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.checkAccess(KeyManagerImpl.java:1037)
>       at 
> org.apache.hadoop.ozone.security.acl.OzoneNativeAuthorizer.checkAccess(OzoneNativeAuthorizer.java:164)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:2584)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:2570)
>       at 
> org.apache.hadoop.ozone.om.OzoneAclUtils.checkAllAcls(OzoneAclUtils.java:128)
>       at 
> org.apache.hadoop.ozone.om.request.OMClientRequest.checkAcls(OMClientRequest.java:360)
>       at 
> org.apache.hadoop.ozone.om.request.OMClientRequest.checkAcls(OMClientRequest.java:215)
>       at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.checkKeyAcls(OMKeyRequest.java:339)
>       at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:138)
>       at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:533)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:324)
>       at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> 2023-03-23 20:32:29,004 [OM StateMachine ApplyTransaction Thread - 0] ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for 
> Key:FILE_2022-03-28-13-38-36_10003/_2022-03-28-13-38-45_0.FILE in 
> Volume/Bucket vol-XXX/prod-XXX
> INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Check 
> access operation failed for 
> key:FILE_2022-03-28-13-38-36_10003/_2022-03-28-13-38-45_0.FILE
>       at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.checkAccess(KeyManagerImpl.java:1083)
>       at 
> org.apache.hadoop.ozone.security.acl.OzoneNativeAuthorizer.checkAccess(OzoneNativeAuthorizer.java:164)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:2584)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:2570)
>       at 
> org.apache.hadoop.ozone.om.OzoneAclUtils.checkAllAcls(OzoneAclUtils.java:128)
>       at 
> org.apache.hadoop.ozone.om.request.OMClientRequest.checkAcls(OMClientRequest.java:360)
>       at 
> org.apache.hadoop.ozone.om.request.OMClientRequest.checkAcls(OMClientRequest.java:215)
>       at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.checkKeyAcls(OMKeyRequest.java:339)
>       at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:138)
>       at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:533)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:324)
>       at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Rocks Database is closed
>       at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:395)
>       at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.get(RocksDatabase.java:517)
>       at org.apache.hadoop.hdds.utils.db.RDBTable.get(RDBTable.java:110)
>       at org.apache.hadoop.hdds.utils.db.RDBTable.get(RDBTable.java:40)
>       at 
> org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:255)
>       at org.apache.hadoop.hdds.utils.db.TypedTable.get(TypedTable.java:181)
>       at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.checkAccess(KeyManagerImpl.java:1037)
>       ... 15 more
> 2023-03-23 20:32:29,012 [OMDoubleBufferFlushThread] ERROR 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer: Terminating with 
> exit status 1: During flush to DB encountered error in OMDoubleBuffer flush 
> thread OMDoubleBufferFlushThread
> java.io.IOException: Rocks Database is closed
>       at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase$ColumnFamily.assertClosed(RocksDatabase.java:310)
>       at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase$ColumnFamily.batchPut(RocksDatabase.java:293)
>       at 
> org.apache.hadoop.hdds.utils.db.RDBBatchOperation.put(RDBBatchOperation.java:62)
>       at 
> org.apache.hadoop.hdds.utils.db.RDBTable.putWithBatch(RDBTable.java:77)
>       at 
> org.apache.hadoop.hdds.utils.db.RDBTable.putWithBatch(RDBTable.java:40)
>       at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:132)
>       at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:149)
>       at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.checkAndUpdateDB(S3MultipartUploadCommitPartResponse.java:117)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$2(OzoneManagerDoubleBuffer.java:283)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:228)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$1(OzoneManagerDoubleBuffer.java:281)
>       at java.util.Iterator.forEachRemaining(Iterator.java:116)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:277)
>       at java.lang.Thread.run(Thread.java:745)
> 2023-03-23 20:32:29,016 [shutdown-hook-0] INFO 
> org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down OzoneManager at bigdata-hostxxx.com/10.xx.xx.xx
> ************************************************************/
> {noformat}
> more OM log see: [OM crashed detail log 
> |https://github.com/whbing/issue_logs/blob/main/ozone/om/om_crash.20230323.log]
> rocksdb log: LOG.old.1679574802031612
> {noformat}
> 2023/03/23-20:32:26.278546 140423117969152 [/db_impl/db_impl_files.cc:445] 
> [JOB 4050] Try to delete WAL files size 42545044203, prev total WAL file size 
> 42958842785, number of live WAL files 326.
> ...
> 2023/03/23-20:32:28.853876 140423224928000 [/version_set.cc:3800] More 
> existing levels in DB than needed. max_bytes_for_level_multiplier may not be 
> guaranteed.
> ...
> 2023/03/23-20:32:28.854153 140423224928000 (Original Log Time 
> 2023/03/23-20:32:28.854112) [/db_impl/db_impl_compaction_flush.cc:284] 
> [multipartInfoTable] Level summary: base level 6 level multiplier 10.00 max 
> bytes base 268435456 files[2 0 0 0 0 0 3] max score 0.50
> ...
> 2023/03/23-20:32:28.998765 140423117969152 [le/delete_scheduler.cc:77] 
> Deleted file /data/ozonedata/om/db/om.db/098324.log immediately, 
> rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
> 2023/03/23-20:32:29.009957 140422757758720 [/db_impl/db_impl.cc:480] 
> Shutdown: canceling all background work
> 2023/03/23-20:32:29.020213 140423117969152 [le/delete_scheduler.cc:77] 
> Deleted file /data/ozonedata/om/db/om.db/098322.log immediately, 
> rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
> ...
> 2023/03/23-20:32:32.829353 140423117969152 [le/delete_scheduler.cc:77] 
> Deleted file /data/ozonedata/om/db/om.db/097921.log immediately, 
> rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
> 2023/03/23-20:32:32.883735 140422757758720 [/db_impl/db_impl.cc:710] Shutdown 
> complete
> {noformat}
> more see : [detail rocks 
> log|https://github.com/whbing/issue_logs/blob/main/ozone/om/LOG.old.1679574802031612]
> core file: hs_err_pid1017701.log :
> {noformat}
> Stack: [0x00007fb6b8883000,0x00007fb6b8984000],  sp=0x00007fb6b89817f8,  free 
> space=1017k
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  org.rocksdb.RocksDB.getLatestSequenceNumber(J)J+0
> j  org.rocksdb.RocksDB.getLatestSequenceNumber()J+5
> j  org.apache.hadoop.hdds.utils.db.RocksDatabase.getLatestSequenceNumber()J+18
> j  
> org.apache.hadoop.hdds.utils.db.RDBStore.getUpdatesSince(JJ)Lorg/apache/hadoop/hdds/utils/db/DBUpdatesWrapper;+395
> j  
> org.apache.hadoop.ozone.om.OzoneManager.getDBUpdates(Lorg/apache/hadoop/ozone/protocol/proto/OzoneManagerProtocolProtos$DBUpdatesRequest;)Lorg/apache/hadoop/ozone/om/helpers/DBUpdates;+30
> j  
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOMDBUpdates(Lorg/apache/hadoop/ozone/protocol/proto/OzoneManagerProtocolProtos$DBUpdatesRequest;)Lorg/apache/hadoop/ozone/p
> rotocol/proto/OzoneManagerProtocolProtos$DBUpdatesResponse;+9
> J 19759 C2 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(Lorg/apache/hadoop/ozone/protocol/proto/OzoneManagerProtocolProtos$OMRequest;)Lorg/apache/hadoop/ozo
> ne/protocol/proto/OzoneManagerProtocolProtos$OMResponse; (840 bytes) @ 
> 0x00007fc99793523c [0x00007fc997933ec0+0x137c]
> J 19799 C2 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/ozone/protocol/proto/OzoneManagerP
> rotocolProtos$OMRequest;)Lorg/apache/hadoop/ozone/protocol/proto/OzoneManagerProtocolProtos$OMResponse;
>  (81 bytes) @ 0x00007fc997dec564 [0x00007fc997deb460+0x1104]
> J 17937 C2 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(Lcom/google/protobuf/Descriptors$MethodDescriptor;Lcom/google/protobuf/Rp
> cController;Lcom/google/protobuf/Message;)Lcom/google/protobuf/Message; (69 
> bytes) @ 0x00007fc995ec62a0 [0x00007fc995ec6140+0x160]
> J 17503 C2 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Lorg/apache/hadoop/ipc/RPC$Server;Ljava/lang/String;Lorg/apache/hadoop/io/Writable;J)Lorg/apache/hadoop/io/Wr
> itable; (830 bytes) @ 0x00007fc99714df78 [0x00007fc99714c9c0+0x15b8]
> J 17497 C2 org.apache.hadoop.ipc.Server$RpcCall.run()Ljava/lang/Object; (5 
> bytes) @ 0x00007fc997237f70 [0x00007fc997237c60+0x310]
> v  ~StubRoutines::call_stub
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to