[
https://issues.apache.org/jira/browse/HDDS-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose updated HDDS-6667:
-----------------------------
Affects Version/s: (was: 1.2.0)
> Recon can crash if processing a container report after installing an OM
> snapshot
> --------------------------------------------------------------------------------
>
> Key: HDDS-6667
> URL: https://issues.apache.org/jira/browse/HDDS-6667
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Recon
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
> Fix For: 1.2.0
>
>
> There are two threads that access Recon's RocksDB instance: One is doing
> updates based on the OM DB state (ContainerKeyMapperTask), the other is doing
> updates based on container reports (ReconContainerReportHandler). When
> ContainerKeyMapperTask is updating from a snapshot, it needs to account for
> keys that may have been deleted, however the snapshot alone does not provide
> this information, so it needs to clear out its existing container -> key
> mappings and rebuild them from scratch. It does this by calling
> ContainerDBServiceProvider#initNewContainerDB, which deletes the whole recon
> DB from the disk and creates a new one. This gives us the current problem:
> 1. ContainerKeyMapperTask#reprocess is called to do a snapshot based update
> from OM.
> 2. ContainerKeyMapperTask deletes and recreates the Recon DB.
> 3. Recon receives and processes a container report. When it needs to update
> the DB it may be using a stale handle from the old DB, or it may be trying to
> access the DB between it being deleted and created.
> This scenario caused a RocksDB crash on Recon, shown in this dump.
> {code}
> C [librocksdbjni4235643658444878552.so+0x242ea2]
> Java_org_rocksdb_RocksDB_get__J_3BIIJ+0x62
> J 7320 org.rocksdb.RocksDB.get(J[BIIJ)[B (0 bytes) @ 0x00007f461e4ff36d
> [0x00007f461e4ff280+0xed]
> J 13283 C2
> org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(Ljava/lang/Object;)Ljava/lang/Object;
> (36 bytes) @ 0x00007f461f32b730 [0x00007f461f32b420+0x310]
> J 13545 C2
> org.apache.hadoop.ozone.recon.spi.impl.ContainerDBServiceProviderImpl.getContainerReplicaHistory(Ljava/lang/Long;)Ljava/util/Map;
> (90 bytes) @ 0x00007f461e8d77ac [0x00007f461e8d7440+0x36c]
> J 8503 C2
> org.apache.hadoop.ozone.recon.scm.ReconContainerManager.upsertContainerHistory(JLjava/util/UUID;JJ)V
> (111 bytes) @ 0x00007f461e8bf3c4 [0x00007f461e8bf2e0+0xe4]
> J 11064 C2
> org.apache.hadoop.ozone.recon.scm.ReconContainerManager.removeContainerReplica(Lorg/apache/hadoop/hdds/scm/container/ContainerID;Lorg/apache/hadoop/hdds/scm/container/ContainerReplica;)V
> (97 bytes) @ 0x00007f461ef420f8 [0x00007f461ef41a40+0x6b8]
> J 13568 C2
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processMissingReplicas(Lorg/apache/hadoop/hdds/protocol/DatanodeDetails;Ljava/util/Set;)V
> (93 bytes) @ 0x00007f461e6afa68 [0x00007f461e6aeec0+0xba8]
> J 16028 C2
> org.apache.hadoop.ozone.recon.scm.ReconContainerReportHandler.onMessage(Ljava/lang/Object;Lorg/apache/hadoop/hdds/server/events/EventPublisher;)V
> (10 bytes) @ 0x00007f461f936188 [0x00007f461f9348c0+0x18c8]
> J 13493 C2
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor$$Lambda$313.run()V
> (20 bytes) @ 0x00007f461e390f5c [0x00007f461e390ec0+0x9c]
> J 17137% C2
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
> (225 bytes) @ 0x00007f461fc093e4 [0x00007f461fc090e0+0x304]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]