[ 
https://issues.apache.org/jira/browse/HDDS-7477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631961#comment-17631961
 ] 

Wei-Chiu Chuang commented on HDDS-7477:
---------------------------------------

I suspect the problem still exists in v3, just that there's much fewer number 
of containers, and all containers are in the cache, so some faulty code path 
didn't run that frequent.

> Possible SchemaV2 regression leading to DN crashes
> --------------------------------------------------
>
>                 Key: HDDS-7477
>                 URL: https://issues.apache.org/jira/browse/HDDS-7477
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Siyao Meng
>            Priority: Critical
>         Attachments: hs_err_pid17904.log, hs_err_pid18777.log
>
>
> When I was running freon with SchemaV2 DNs (based on 1-month old master 
> branch and some snapshot commits, but the issue is on DN, so unrelated to 
> snapshot features) I encountered DN crashes on two different cluster 
> deployments. This is with RocksDB JNI 7.4.5. Seems to be a DB handle / column 
> family handle issue.
> The freon command that triggers the crash after 10k~600k keys are generated:
> {code:title=freon command}
> ozone freon rk --num-of-volumes=2 --num-of-buckets=4 --num-of-keys=1000000 
> --num-of-threads=20 --key-size=1 --factor=THREE --type=RATIS --validate-writes
> {code}
> {code:title=DN2 crash}
> Thu Nov  3 02:50:10 UTC 2022: Starting Ozone Datanode...
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007fe000000001, pid=18777, tid=0x00007fe01b9a8700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # C  0x00007fe000000001
> #
> # Core dump written. Default location: 
> /run/cloudera-scm-agent/process/35-ozone-OZONE_DATANODE/core or core.18777
> #
> # An error report file with more information is saved as:
> # /run/cloudera-scm-agent/process/35-ozone-OZONE_DATANODE/hs_err_pid18777.log
> Compiled method (nm) 51318506 12809     n 0       org.rocksdb.RocksDB::get 
> (native)
>  total in heap  [0x00007fe045b92d90,0x00007fe045b93128] = 920
>  relocation     [0x00007fe045b92eb8,0x00007fe045b92f00] = 72
>  main code      [0x00007fe045b92f00,0x00007fe045b93120] = 544
>  oops           [0x00007fe045b93120,0x00007fe045b93128] = 8
> [thread 140600626910976 also had an error]
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {code}
> Crash seems to be coming from 
> {{replication.DownloadAndImportReplicator.importContainer}}:
> {code:title=hs_err_pid18777.log}
> Stack: [0x00007fe01b8a8000,0x00007fe01b9a9000],  sp=0x00007fe01b9a6cf8,  free 
> space=1019k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  0x00007fe000000001
> C  [librocksdbjni8641814166209258633.so+0x3f8ede]  
> rocksdb::DBImplReadOnly::Get(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, 
> std::string*)+0x86e
> C  [librocksdbjni8641814166209258633.so+0x3f4a27]  
> rocksdb::DBImplReadOnly::Get(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> rocksdb::PinnableSlice*)+0x17
> C  [librocksdbjni8641814166209258633.so+0x2a78a6]  
> rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, 
> rocksdb::Slice const&, std::string*)+0x146
> C  [librocksdbjni8641814166209258633.so+0x29ed6a]  
> rocksdb_get_helper(JNIEnv_*, rocksdb::DB*, rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, _jbyteArray*, int, int)+0xda
> C  [librocksdbjni8641814166209258633.so+0x29efc2]  
> Java_org_rocksdb_RocksDB_get__J_3BIIJ+0x62
> J 12809  org.rocksdb.RocksDB.get(J[BIIJ)[B (0 bytes) @ 0x00007fe045b92fcd 
> [0x00007fe045b92f00+0xcd]
> J 18274 C2 
> org.apache.hadoop.hdds.utils.db.TypedTable.get(Ljava/lang/Object;)Ljava/lang/Object;
>  (65 bytes) @ 0x00007fe046accd54 [0x00007fe046acc960+0x3f4]
> J 18350 C2 
> org.apache.hadoop.ozone.container.metadata.DatanodeTable.get(Ljava/lang/Object;)Ljava/lang/Object;
>  (11 bytes) @ 0x00007fe04734d7e8 [0x00007fe04734d7a0+0x48]
> j  
> org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.populateContainerMetadata(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainerData;Lorg/apache/hadoop/ozone/container/metadata/DatanodeStore;)V+14
> j  
> org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.parseKVContainerData(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainerData;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)V+232
> j  
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.importContainerData(Ljava/io/InputStream;Lorg/apache/hadoop/ozone/container/common/interfaces/ContainerPacker;)V+187
> j  
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.importContainer(Lorg/apache/hadoop/ozone/container/common/impl/ContainerData;Ljava/io/InputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)Lorg/apache/hadoop/ozone/container/common/interfaces/Container;+48
> j  
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.importContainer(Lorg/apache/hadoop/ozone/container/common/impl/ContainerData;Ljava/io/InputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)Lorg/apache/hadoop/ozone/container/common/interfaces/Container;+19
> j  
> org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.importContainer(JLjava/nio/file/Path;)V+153
> j  
> org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.replicate(Lorg/apache/hadoop/ozone/container/replication/ReplicationTask;)V+92
> j  
> org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(Lorg/apache/hadoop/ozone/container/replication/ReplicationTask;)V+33
> j  
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run()V+149
> J 18361 C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x00007fe045724880 [0x00007fe0457246a0+0x1e0]
> J 12588 C1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 
> 0x00007fe044e9e404 [0x00007fe044e9e300+0x104]
> {code}
> DN3 is crashing in {{GrpcReplicationService.download}} / 
> {{ContainerController.exportContainer}}:
> {code:title=hs_err_pid17904.log}
> Stack: [0x00007face3e61000,0x00007face3f62000],  sp=0x00007face3f5fe48,  free 
> space=1019k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  0x00007fad0901d5f0
> j  org.rocksdb.ColumnFamilyHandle.getName(J)[B+0
> j  org.rocksdb.ColumnFamilyHandle.getName()[B+33
> j  
> org.apache.hadoop.hdds.utils.db.RocksDatabase$ColumnFamily.<init>(Lorg/rocksdb/ColumnFamilyHandle;)V+6
> j  
> org.apache.hadoop.hdds.utils.db.RocksDatabase.open(Ljava/io/File;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedDBOptions;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedWriteOptions;Ljava/util/Set;Z)Lorg/apache/hadoop/hdds/utils/db/RocksDatabase;+147
> j  
> org.apache.hadoop.hdds.utils.db.RDBStore.<init>(Ljava/io/File;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedDBOptions;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedWriteOptions;Ljava/util/Set;Lorg/apache/hadoop/hdds/utils/db/CodecRegistry;ZILjava/lang/String;Z)V+131
> j  
> org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build()Lorg/apache/hadoop/hdds/utils/db/DBStore;+134
> j  
> org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)V+304
> j  
> org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;Lorg/apache/hadoop/ozone/container/metadata/AbstractDatanodeDBDefinition;Z)V+47
> j  
> org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaTwoImpl.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;Ljava/lang/String;Z)V+12
> j  
> org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;Z)Lorg/apache/hadoop/ozone/container/metadata/DatanodeStore;+40
> j  
> org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB;+173
> J 17809 C2 
> org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getDB(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainerData;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/interfaces/DBHandle;
>  (129 bytes) @ 0x00007fad0fc43940 [0x00007fad0fc43760+0x1e0]
> j  org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.compactDB()V+8
> j  
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.exportContainerData(Ljava/io/OutputStream;Lorg/apache/hadoop/ozone/container/common/interfaces/ContainerPacker;)V+84
> j  
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.exportContainer(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Ljava/io/OutputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)V+10
> j  
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.exportContainer(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerType;JLjava/io/OutputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)V+25
> j  
> org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(JLjava/io/OutputStream;)V+67
> j  
> org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$CopyContainerRequestProto;Lorg/apache/ratis/thirdparty/io/grpc/stub/StreamObserver;)V+39
> j  
> org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(Ljava/lang/Object;Lorg/apache/ratis/thirdparty/io/grpc/stub/StreamObserver;)V+33
> j  
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose()V+53
> J 18253 C2 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext()V
>  (73 bytes) @ 0x00007fad0e728d98 [0x00007fad0e728a40+0x358]
> J 13760 C2 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run()V (35 
> bytes) @ 0x00007fad0f005f64 [0x00007fad0f005820+0x744]
> J 16944 C2 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run()V (114 
> bytes) @ 0x00007fad0fa953f0 [0x00007fad0fa95300+0xf0]
> J 18310 C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x00007fad0f5da554 [0x00007fad0f5da320+0x234]
> J 12800 C1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 
> 0x00007fad0e3a9684 [0x00007fad0e3a9580+0x104]
> J 12653 C1 java.lang.Thread.run()V (17 bytes) @ 0x00007fad0db96fc4 
> [0x00007fad0db96e80+0x144]
> {code}
> Note that once I switched to using SchemaV3, I don't see the crash anymore. 
> Over 10 mil keys are corrected generated.
> 1.3.0 branch could be affected as well. Hence setting the affected version to 
> 1.3.0. Pending investigation.
> cc [~captainzmc] [~erose] [~ritesh] [~weichiu] [[email protected]] 
> [~duongnguyen]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to