[
https://issues.apache.org/jira/browse/HDDS-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose updated HDDS-6667:
-----------------------------
Attachment: hs_err_pid46101.log.txt
> Recon can crash if processing a container report after installing an OM
> snapshot
> --------------------------------------------------------------------------------
>
> Key: HDDS-6667
> URL: https://issues.apache.org/jira/browse/HDDS-6667
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Recon
> Affects Versions: 1.1.0, 1.2.0, 1.2.1
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
> Attachments: hs_err_pid46101.log.txt
>
>
> There are two threads that access Recon's RocksDB instance: One is doing
> updates based on the OM DB state (ContainerKeyMapperTask), the other is doing
> updates based on container reports (ReconContainerReportHandler). When
> ContainerKeyMapperTask is updating from a snapshot, it needs to account for
> keys that may have been deleted, however the snapshot alone does not provide
> this information, so it needs to clear out its existing container -> key
> mappings and rebuild them from scratch. It does this by calling
> ContainerDBServiceProvider#initNewContainerDB, which deletes the whole recon
> DB from the disk and creates a new one. This gives us the current problem:
> 1. ContainerKeyMapperTask#reprocess is called to do a snapshot based update
> from OM.
> 2. ContainerKeyMapperTask deletes and recreates the Recon DB.
> 3. Recon receives and processes a container report. When it needs to update
> the DB it may be using a stale handle from the old DB, or it may be trying to
> access the DB between it being deleted and created.
> This scenario caused a RocksDB crash on Recon, shown in the attached crash
> dump.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]