[ 
https://issues.apache.org/jira/browse/HDDS-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Rose updated HDDS-6667:
-----------------------------
    Affects Version/s:     (was: 1.2.0)

> Recon can crash if processing a container report after installing an OM 
> snapshot
> --------------------------------------------------------------------------------
>
>                 Key: HDDS-6667
>                 URL: https://issues.apache.org/jira/browse/HDDS-6667
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Recon
>            Reporter: Ethan Rose
>            Assignee: Ethan Rose
>            Priority: Major
>             Fix For: 1.2.0
>
>
> There are two threads that access Recon's RocksDB instance: One is doing 
> updates based on the OM DB state (ContainerKeyMapperTask), the other is doing 
> updates based on container reports (ReconContainerReportHandler). When 
> ContainerKeyMapperTask is updating from a snapshot, it needs to account for 
> keys that may have been deleted, however the snapshot alone does not provide 
> this information, so it needs to clear out its existing container -> key 
> mappings and rebuild them from scratch. It does this by calling 
> ContainerDBServiceProvider#initNewContainerDB, which deletes the whole recon 
> DB from the disk and creates a new one. This gives us the current problem:
> 1. ContainerKeyMapperTask#reprocess is called to do a snapshot based update 
> from OM.
> 2. ContainerKeyMapperTask deletes and recreates the Recon DB.
> 3. Recon receives and processes a container report. When it needs to update 
> the DB it may be using a stale handle from the old DB, or it may be trying to 
> access the DB between it being deleted and created.
> This scenario caused a RocksDB crash on Recon, shown in this dump.
> {code}
> C  [librocksdbjni4235643658444878552.so+0x242ea2]  
> Java_org_rocksdb_RocksDB_get__J_3BIIJ+0x62
> J 7320  org.rocksdb.RocksDB.get(J[BIIJ)[B (0 bytes) @ 0x00007f461e4ff36d 
> [0x00007f461e4ff280+0xed]
> J 13283 C2 
> org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(Ljava/lang/Object;)Ljava/lang/Object;
>  (36 bytes) @ 0x00007f461f32b730 [0x00007f461f32b420+0x310]
> J 13545 C2 
> org.apache.hadoop.ozone.recon.spi.impl.ContainerDBServiceProviderImpl.getContainerReplicaHistory(Ljava/lang/Long;)Ljava/util/Map;
>  (90 bytes) @ 0x00007f461e8d77ac [0x00007f461e8d7440+0x36c]
> J 8503 C2 
> org.apache.hadoop.ozone.recon.scm.ReconContainerManager.upsertContainerHistory(JLjava/util/UUID;JJ)V
>  (111 bytes) @ 0x00007f461e8bf3c4 [0x00007f461e8bf2e0+0xe4]
> J 11064 C2 
> org.apache.hadoop.ozone.recon.scm.ReconContainerManager.removeContainerReplica(Lorg/apache/hadoop/hdds/scm/container/ContainerID;Lorg/apache/hadoop/hdds/scm/container/ContainerReplica;)V
>  (97 bytes) @ 0x00007f461ef420f8 [0x00007f461ef41a40+0x6b8]
> J 13568 C2 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processMissingReplicas(Lorg/apache/hadoop/hdds/protocol/DatanodeDetails;Ljava/util/Set;)V
>  (93 bytes) @ 0x00007f461e6afa68 [0x00007f461e6aeec0+0xba8]
> J 16028 C2 
> org.apache.hadoop.ozone.recon.scm.ReconContainerReportHandler.onMessage(Ljava/lang/Object;Lorg/apache/hadoop/hdds/server/events/EventPublisher;)V
>  (10 bytes) @ 0x00007f461f936188 [0x00007f461f9348c0+0x18c8]
> J 13493 C2 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor$$Lambda$313.run()V 
> (20 bytes) @ 0x00007f461e390f5c [0x00007f461e390ec0+0x9c]
> J 17137% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x00007f461fc093e4 [0x00007f461fc090e0+0x304]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to