[
https://issues.apache.org/jira/browse/HDDS-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866295#comment-17866295
]
Arafat Khan edited comment on HDDS-11187 at 7/18/24 6:25 PM:
-------------------------------------------------------------
h3. How to Replicate the ClassCastException ➖
To understand how to reproduce this problem, it's crucial to understand how the
{{ClassCastException}} occurs and what bug in Recon led to it. When an event
from OM comes to Recon, {{OMDBUpdatesHandler}} is responsible for packaging
this information into an event object class called {{{}OMDBUpdateEvent{}}},
which has two essential fields: {{updatedValue}} and {{{}oldValue{}}}.
{*}{{oldValue}}{*}: This is the previous state of a database entry before an
update (PUT) or delete (DELETE) operation. It represents what was stored in the
database before the current operation.
{*}{{newValue}}{*}: This is the new state of a database entry after an update
(UPDATE) or create (PUT) operation. It represents the current state being
written to the database.
The updated value is fetched from the OM side, while the oldValue is fetched
from an existing map inside the {{OMDBUpdatesHandler}} called
{{{}omdbLatestUpdateEvents{}}}. The old map, {{{}omdbLatestUpdateEvents{}}},
was a simple {{Map<Object, OMDBUpdateEvent>}} that stored the latest database
update events without distinguishing between different tables. This led to
conflicts and potential corruption when different tables had the same key
structure, causing issues like {{ClassCastException}} during event processing
in Recon.
For example, consider the {{FileTable}} and the {{{}DirectoryTable{}}}. Both
tables have the same key structure in RocksDB:
* {{directoryTable: /volumeId/bucketId/parentId/dirName -> DirInfo}}
* {{fileTable: /volumeId/bucketId/parentId/fileName -> KeyInfo}}
Here, using the same name for a file and a directory will cause an error. If we
execute the following commands, we will encounter a ClassCastException because
the file name and directory name are the same:
{code:java}
ozone sh key put s3v/fso-bucket/dir6 NOTICE.txt
ozone sh key delete s3v/fso-bucket/dir6
ozone fs -mkdir -p ofs://om/s3v/fso-bucket/dir6{code}
h3. Breakdown of the Error:
# {*}First Command{*}: The first command will be recorded as a PUT operation
on the {{FileTable}} in {{OMDBUpdateEvent}} and stored in the
{{omdbLatestUpdateEvents}} map as:
** Key: {{/volumeId/bucketId/parentId/dirName}}
** Value: {{OMDBUpdateEvent}} with {{updatedValue}} as {{OmKeyInfo}} for the
file named {{dir6}} and {{oldValue}} as {{{}null{}}}.
# {*}Second Command{*}: The second command will be recorded as a DELETE
operation on the {{FileTable}} in {{{}OMDBUpdateEvent{}}}. It will first check
the {{omdbLatestUpdateEvents}} for a previous mention of this key (dir6). It
finds the previous PUT operation, so the record in the map changes to a DELETE
operation with {{updatedValue}} as {{OmKeyInfo}} (the value to be deleted) and
{{oldValue}} as {{{}null{}}}.
# {*}Third Command{*}: When creating a directory named {{{}dir6{}}}, it
results in a new {{OMDBUpdateEvent}} for the {{{}DirectoryTable{}}}. The
{{newValue}} will be an {{OmDirectoryInfo}} object. However, when it checks
{{{}omdbLatestUpdateEvents{}}}, it finds the old value associated with the
previous DELETE operation, which is {{OmKeyInfo}} (the newValue of the delete
event). This mismatch (newValue as {{OmDirectoryInfo}} and oldValue as
{{{}OmKeyInfo{}}}) leads to a ClassCastException.
*To prevent such issues, we implemented a safeguard (HDDS-8310) that checks for
value mismatches and ignores such events. However, ignoring these events is not
ideal, as it can lead to data inconsistency. For example, Recon would never
know about the directory {{{}dir6{}}}, leading to data inconsistency. To fix
this, we need to implement a map that distinguishes between different tables.*
was (Author: JIRAUSER284839):
h3. Problem Description
The existing implementation of *{{OMDBUpdatesHandler}}* in Recon saves events
from all tables into a single map, {*}{{omdbLatestUpdateEvents}}{*}, using just
the key structure. This leads to corruption when different tables, such as
*{{keyTable}}* and {*}{{deletedTable}}{*}, use the same key structure
({*}{{/volumeName/bucketName/keyName}}{*}). Consequently, events from different
tables with identical keys can overwrite each other, resulting in data
inconsistencies and causing issues like *{{ClassCastException}}* when the wrong
event type is retrieved and cast downstream.
h3. Solution Description
To resolve this issue, we propose modifying the *{{omdbLatestUpdateEvents}}*
map to include an additional layer that incorporates the table name. The new
structure will be a nested map: \{*}{{Map<String, Map<Object,
OMDBUpdateEvent>>}}{*}, where the outer map's key is the table name, and the
inner map's key is the actual key structure. This ensures that events from
different tables with the same key structure are stored separately, avoiding
conflicts.
h3. Fix for ClassCastException and Future Improvements
This solution will fix the *{{ClassCastException}}* problem from the Recon end
by ensuring that events from different tables are isolated within the map,
preventing them from overwriting each other. However, we still need to address
the root cause from the OM end to prevent the creation of incorrect events.
Ensuring that each event is correctly classified and stored in the appropriate
table at the source will further reinforce data integrity and prevent similar
issues in the future. Additionally, logs have been added earlier to capture
such events that can lead to {*}{{ClassCastException}}{*}. Generally, we ignore
these events because these corrupted events are generated from the OM side of
the code and hence need to be fixed. Our changes in this patch only fix the
problem of possible corruption in event creation on the Recon side. The problem
at the OM side still persists and needs to be fixed. The newly added logs in
the previous patch by [HDDS-8310|[https://github.com/apache/ozone/pull/5043]]
will help in identifying and reporting these issues. We just have to wait for
the reporting of these logs.
> Fix Event Handling Corruption in OMDBUpdatesHandler to Prevent
> ClassCastException in Recon Server
> -------------------------------------------------------------------------------------------------
>
> Key: HDDS-11187
> URL: https://issues.apache.org/jira/browse/HDDS-11187
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Recon
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.5.0
>
>
> A *ClassCastException* occurs in the Recon server during the
> FileSizeCountTask, where RepeatedOmKeyInfo is incorrectly cast to OmKeyInfo,
> causing task processing to fail.
>
> {code:java}
> 2024-06-11 10:40:03,700 INFO
> org.apache.hadoop.ozone.recon.tasks.FileSizeCountTask: Completed a 'process'
> run of FileSizeCountTask.
> 2024-06-11 10:40:03,700 ERROR
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl: Unexpected error
> :
> java.util.concurrent.ExecutionException: java.lang.ClassCastException: class
> org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo cannot be cast to class
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo
> (org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo and
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo are in unnamed module of loader
> 'app')
> at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
> at
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.processTaskResults(ReconTaskControllerImpl.java:247)
> at
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.consumeOMEvents(ReconTaskControllerImpl.java:118)
> at
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.syncDataFromOM(OzoneManagerServiceProviderImpl.java:511)
> at
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.lambda$startSyncDataFromOM$0(OzoneManagerServiceProviderImpl.java:258)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at
> java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
> at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ClassCastException: class
> org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo cannot be cast to class
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo
> (org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo and
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo are in unnamed module of loader
> 'app')
> at
> org.apache.hadoop.ozone.recon.tasks.NSSummaryTaskWithFSO.processWithFSO(NSSummaryTaskWithFSO.java:90)
> at
> org.apache.hadoop.ozone.recon.tasks.NSSummaryTask.process(NSSummaryTask.java:97)
> at
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.lambda$consumeOMEvents$0(ReconTaskControllerImpl.java:113)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> ... 3 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]