Siyao Meng created HDDS-7477:
--------------------------------

             Summary: Possible SchemaV2 regression leading to DN crashes
                 Key: HDDS-7477
                 URL: https://issues.apache.org/jira/browse/HDDS-7477
             Project: Apache Ozone
          Issue Type: Bug
    Affects Versions: 1.3.0
            Reporter: Siyao Meng
         Attachments: hs_err_pid17904.log, hs_err_pid18777.log

When I was running freon with SchemaV2 DNs (based on 1-month old master branch 
and some snapshot commits, but the issue is on DN, so unrelated to snapshot 
features) I encountered DN crashes on two different cluster deployments. This 
is with RocksDB JNI 7.4.5. Seems to be a DB handle / column family handle issue.

The freon command that triggers the crash after 10k~600k keys are generated:

{code:title=freon command}
ozone freon rk --num-of-volumes=2 --num-of-buckets=4 --num-of-keys=1000000 
--num-of-threads=20 --key-size=1 --factor=THREE --type=RATIS --validate-writes
{code}

{code:title=DN2 crash}
Thu Nov  3 02:50:10 UTC 2022: Starting Ozone Datanode...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe000000001, pid=18777, tid=0x00007fe01b9a8700
#
# JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  0x00007fe000000001
#
# Core dump written. Default location: 
/run/cloudera-scm-agent/process/35-ozone-OZONE_DATANODE/core or core.18777
#
# An error report file with more information is saved as:
# /run/cloudera-scm-agent/process/35-ozone-OZONE_DATANODE/hs_err_pid18777.log
Compiled method (nm) 51318506 12809     n 0       org.rocksdb.RocksDB::get 
(native)
 total in heap  [0x00007fe045b92d90,0x00007fe045b93128] = 920
 relocation     [0x00007fe045b92eb8,0x00007fe045b92f00] = 72
 main code      [0x00007fe045b92f00,0x00007fe045b93120] = 544
 oops           [0x00007fe045b93120,0x00007fe045b93128] = 8
[thread 140600626910976 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
{code}

Crash seems to be coming from 
{{replication.DownloadAndImportReplicator.importContainer}}:

{code:title=hs_err_pid18777.log}
Stack: [0x00007fe01b8a8000,0x00007fe01b9a9000],  sp=0x00007fe01b9a6cf8,  free 
space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x00007fe000000001
C  [librocksdbjni8641814166209258633.so+0x3f8ede]  
rocksdb::DBImplReadOnly::Get(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, 
std::string*)+0x86e
C  [librocksdbjni8641814166209258633.so+0x3f4a27]  
rocksdb::DBImplReadOnly::Get(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
rocksdb::PinnableSlice*)+0x17
C  [librocksdbjni8641814166209258633.so+0x2a78a6]  
rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, 
rocksdb::Slice const&, std::string*)+0x146
C  [librocksdbjni8641814166209258633.so+0x29ed6a]  rocksdb_get_helper(JNIEnv_*, 
rocksdb::DB*, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, 
_jbyteArray*, int, int)+0xda
C  [librocksdbjni8641814166209258633.so+0x29efc2]  
Java_org_rocksdb_RocksDB_get__J_3BIIJ+0x62
J 12809  org.rocksdb.RocksDB.get(J[BIIJ)[B (0 bytes) @ 0x00007fe045b92fcd 
[0x00007fe045b92f00+0xcd]
J 18274 C2 
org.apache.hadoop.hdds.utils.db.TypedTable.get(Ljava/lang/Object;)Ljava/lang/Object;
 (65 bytes) @ 0x00007fe046accd54 [0x00007fe046acc960+0x3f4]
J 18350 C2 
org.apache.hadoop.ozone.container.metadata.DatanodeTable.get(Ljava/lang/Object;)Ljava/lang/Object;
 (11 bytes) @ 0x00007fe04734d7e8 [0x00007fe04734d7a0+0x48]
j  
org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.populateContainerMetadata(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainerData;Lorg/apache/hadoop/ozone/container/metadata/DatanodeStore;)V+14
j  
org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.parseKVContainerData(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainerData;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)V+232
j  
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.importContainerData(Ljava/io/InputStream;Lorg/apache/hadoop/ozone/container/common/interfaces/ContainerPacker;)V+187
j  
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.importContainer(Lorg/apache/hadoop/ozone/container/common/impl/ContainerData;Ljava/io/InputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)Lorg/apache/hadoop/ozone/container/common/interfaces/Container;+48
j  
org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.importContainer(Lorg/apache/hadoop/ozone/container/common/impl/ContainerData;Ljava/io/InputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)Lorg/apache/hadoop/ozone/container/common/interfaces/Container;+19
j  
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.importContainer(JLjava/nio/file/Path;)V+153
j  
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.replicate(Lorg/apache/hadoop/ozone/container/replication/ReplicationTask;)V+92
j  
org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(Lorg/apache/hadoop/ozone/container/replication/ReplicationTask;)V+33
j  
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run()V+149
J 18361 C2 
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
 (225 bytes) @ 0x00007fe045724880 [0x00007fe0457246a0+0x1e0]
J 12588 C1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 
0x00007fe044e9e404 [0x00007fe044e9e300+0x104]
{code}


DN3 is crashing in {{GrpcReplicationService.download}} / 
{{ContainerController.exportContainer}}:

{code:title=hs_err_pid17904.log}
Stack: [0x00007face3e61000,0x00007face3f62000],  sp=0x00007face3f5fe48,  free 
space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x00007fad0901d5f0
j  org.rocksdb.ColumnFamilyHandle.getName(J)[B+0
j  org.rocksdb.ColumnFamilyHandle.getName()[B+33
j  
org.apache.hadoop.hdds.utils.db.RocksDatabase$ColumnFamily.<init>(Lorg/rocksdb/ColumnFamilyHandle;)V+6
j  
org.apache.hadoop.hdds.utils.db.RocksDatabase.open(Ljava/io/File;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedDBOptions;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedWriteOptions;Ljava/util/Set;Z)Lorg/apache/hadoop/hdds/utils/db/RocksDatabase;+147
j  
org.apache.hadoop.hdds.utils.db.RDBStore.<init>(Ljava/io/File;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedDBOptions;Lorg/apache/hadoop/hdds/utils/db/managed/ManagedWriteOptions;Ljava/util/Set;Lorg/apache/hadoop/hdds/utils/db/CodecRegistry;ZILjava/lang/String;Z)V+131
j  
org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build()Lorg/apache/hadoop/hdds/utils/db/DBStore;+134
j  
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)V+304
j  
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;Lorg/apache/hadoop/ozone/container/metadata/AbstractDatanodeDBDefinition;Z)V+47
j  
org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaTwoImpl.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;Ljava/lang/String;Z)V+12
j  
org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;Z)Lorg/apache/hadoop/ozone/container/metadata/DatanodeStore;+40
j  
org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB;+173
J 17809 C2 
org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getDB(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainerData;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/interfaces/DBHandle;
 (129 bytes) @ 0x00007fad0fc43940 [0x00007fad0fc43760+0x1e0]
j  org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.compactDB()V+8
j  
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.exportContainerData(Ljava/io/OutputStream;Lorg/apache/hadoop/ozone/container/common/interfaces/ContainerPacker;)V+84
j  
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.exportContainer(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Ljava/io/OutputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)V+10
j  
org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.exportContainer(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerType;JLjava/io/OutputStream;Lorg/apache/hadoop/ozone/container/keyvalue/TarContainerPacker;)V+25
j  
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(JLjava/io/OutputStream;)V+67
j  
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$CopyContainerRequestProto;Lorg/apache/ratis/thirdparty/io/grpc/stub/StreamObserver;)V+39
j  
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(Ljava/lang/Object;Lorg/apache/ratis/thirdparty/io/grpc/stub/StreamObserver;)V+33
j  
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose()V+53
J 18253 C2 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext()V
 (73 bytes) @ 0x00007fad0e728d98 [0x00007fad0e728a40+0x358]
J 13760 C2 org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run()V 
(35 bytes) @ 0x00007fad0f005f64 [0x00007fad0f005820+0x744]
J 16944 C2 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run()V (114 
bytes) @ 0x00007fad0fa953f0 [0x00007fad0fa95300+0xf0]
J 18310 C2 
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
 (225 bytes) @ 0x00007fad0f5da554 [0x00007fad0f5da320+0x234]
J 12800 C1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 
0x00007fad0e3a9684 [0x00007fad0e3a9580+0x104]
J 12653 C1 java.lang.Thread.run()V (17 bytes) @ 0x00007fad0db96fc4 
[0x00007fad0db96e80+0x144]
{code}

Note that once I switched to using SchemaV3, I don't see the crash anymore. 
Over 10 mil keys are corrected generated.

1.3.0 branch could be affected as well. Hence setting the affected version to 
1.3.0. Pending investigation.

cc [~captainzmc] [~erose] [~ritesh] [~weichiu] [[email protected]] 
[~duongnguyen]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to