[jira] [Comment Edited] (HDDS-11187) Fix Event Handling Corruption in OMDBUpdatesHandler to Prevent ClassCastException in Recon Server

Arafat Khan (Jira) Thu, 18 Jul 2024 11:26:18 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866295#comment-17866295
 ]


Arafat Khan edited comment on HDDS-11187 at 7/18/24 6:25 PM:
-------------------------------------------------------------

h3. How to Replicate the ClassCastException ➖

To understand how to reproduce this problem, it's crucial to understand how the 
{{ClassCastException}} occurs and what bug in Recon led to it. When an event 
from OM comes to Recon, {{OMDBUpdatesHandler}} is responsible for packaging 
this information into an event object class called {{{}OMDBUpdateEvent{}}}, 
which has two essential fields: {{updatedValue}} and {{{}oldValue{}}}.

{*}{{oldValue}}{*}: This is the previous state of a database entry before an 
update (PUT) or delete (DELETE) operation. It represents what was stored in the 
database before the current operation.

{*}{{newValue}}{*}: This is the new state of a database entry after an update 
(UPDATE) or create (PUT) operation. It represents the current state being 
written to the database.

The updated value is fetched from the OM side, while the oldValue is fetched 
from an existing map inside the {{OMDBUpdatesHandler}} called 
{{{}omdbLatestUpdateEvents{}}}. The old map, {{{}omdbLatestUpdateEvents{}}}, 
was a simple {{Map<Object, OMDBUpdateEvent>}} that stored the latest database 
update events without distinguishing between different tables. This led to 
conflicts and potential corruption when different tables had the same key 
structure, causing issues like {{ClassCastException}} during event processing 
in Recon.

For example, consider the {{FileTable}} and the {{{}DirectoryTable{}}}. Both 
tables have the same key structure in RocksDB:
 * {{directoryTable: /volumeId/bucketId/parentId/dirName -> DirInfo}}
 * {{fileTable: /volumeId/bucketId/parentId/fileName -> KeyInfo}}

Here, using the same name for a file and a directory will cause an error. If we 
execute the following commands, we will encounter a ClassCastException because 
the file name and directory name are the same:
 
{code:java}
ozone sh key put s3v/fso-bucket/dir6 NOTICE.txt
ozone sh key delete s3v/fso-bucket/dir6
ozone fs -mkdir -p ofs://om/s3v/fso-bucket/dir6{code}
h3. Breakdown of the Error:
 # {*}First Command{*}: The first command will be recorded as a PUT operation 
on the {{FileTable}} in {{OMDBUpdateEvent}} and stored in the 
{{omdbLatestUpdateEvents}} map as:
 ** Key: {{/volumeId/bucketId/parentId/dirName}}
 ** Value: {{OMDBUpdateEvent}} with {{updatedValue}} as {{OmKeyInfo}} for the 
file named {{dir6}} and {{oldValue}} as {{{}null{}}}.
 # {*}Second Command{*}: The second command will be recorded as a DELETE 
operation on the {{FileTable}} in {{{}OMDBUpdateEvent{}}}. It will first check 
the {{omdbLatestUpdateEvents}} for a previous mention of this key (dir6). It 
finds the previous PUT operation, so the record in the map changes to a DELETE 
operation with {{updatedValue}} as {{OmKeyInfo}} (the value to be deleted) and 
{{oldValue}} as {{{}null{}}}.
 # {*}Third Command{*}: When creating a directory named {{{}dir6{}}}, it 
results in a new {{OMDBUpdateEvent}} for the {{{}DirectoryTable{}}}. The 
{{newValue}} will be an {{OmDirectoryInfo}} object. However, when it checks 
{{{}omdbLatestUpdateEvents{}}}, it finds the old value associated with the 
previous DELETE operation, which is {{OmKeyInfo}} (the newValue of the delete 
event). This mismatch (newValue as {{OmDirectoryInfo}} and oldValue as 
{{{}OmKeyInfo{}}}) leads to a ClassCastException.

*To prevent such issues, we implemented a safeguard (HDDS-8310) that checks for 
value mismatches and ignores such events. However, ignoring these events is not 
ideal, as it can lead to data inconsistency. For example, Recon would never 
know about the directory {{{}dir6{}}}, leading to data inconsistency. To fix 
this, we need to implement a map that distinguishes between different tables.*


was (Author: JIRAUSER284839):
h3. Problem Description

The existing implementation of *{{OMDBUpdatesHandler}}* in Recon saves events 
from all tables into a single map, {*}{{omdbLatestUpdateEvents}}{*}, using just 
the key structure. This leads to corruption when different tables, such as 
*{{keyTable}}* and {*}{{deletedTable}}{*}, use the same key structure 
({*}{{/volumeName/bucketName/keyName}}{*}). Consequently, events from different 
tables with identical keys can overwrite each other, resulting in data 
inconsistencies and causing issues like *{{ClassCastException}}* when the wrong 
event type is retrieved and cast downstream.
h3. Solution Description

To resolve this issue, we propose modifying the *{{omdbLatestUpdateEvents}}* 
map to include an additional layer that incorporates the table name. The new 
structure will be a nested map: \{*}{{Map<String, Map<Object, 
OMDBUpdateEvent>>}}{*}, where the outer map's key is the table name, and the 
inner map's key is the actual key structure. This ensures that events from 
different tables with the same key structure are stored separately, avoiding 
conflicts.
h3. Fix for ClassCastException and Future Improvements

This solution will fix the *{{ClassCastException}}* problem from the Recon end 
by ensuring that events from different tables are isolated within the map, 
preventing them from overwriting each other. However, we still need to address 
the root cause from the OM end to prevent the creation of incorrect events. 
Ensuring that each event is correctly classified and stored in the appropriate 
table at the source will further reinforce data integrity and prevent similar 
issues in the future. Additionally, logs have been added earlier to capture 
such events that can lead to {*}{{ClassCastException}}{*}. Generally, we ignore 
these events because these corrupted events are generated from the OM side of 
the code and hence need to be fixed. Our changes in this patch only fix the 
problem of possible corruption in event creation on the Recon side. The problem 
at the OM side still persists and needs to be fixed. The newly added logs in 
the previous patch by [HDDS-8310|[https://github.com/apache/ozone/pull/5043]]  
will help in identifying and reporting these issues. We just have to wait for 
the reporting of these logs.

> Fix Event Handling Corruption in OMDBUpdatesHandler to Prevent 
> ClassCastException in Recon Server
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-11187
>                 URL: https://issues.apache.org/jira/browse/HDDS-11187
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Recon
>            Reporter: Arafat Khan
>            Assignee: Arafat Khan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.5.0
>
>
> A *ClassCastException* occurs in the Recon server during the 
> FileSizeCountTask, where RepeatedOmKeyInfo is incorrectly cast to OmKeyInfo, 
> causing task processing to fail.
>  
> {code:java}
> 2024-06-11 10:40:03,700 INFO 
> org.apache.hadoop.ozone.recon.tasks.FileSizeCountTask: Completed a 'process' 
> run of FileSizeCountTask.
> 2024-06-11 10:40:03,700 ERROR 
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl: Unexpected error 
> : 
> java.util.concurrent.ExecutionException: java.lang.ClassCastException: class 
> org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo cannot be cast to class 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo 
> (org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo and 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo are in unnamed module of loader 
> 'app')
>       at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>       at 
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.processTaskResults(ReconTaskControllerImpl.java:247)
>       at 
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.consumeOMEvents(ReconTaskControllerImpl.java:118)
>       at 
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.syncDataFromOM(OzoneManagerServiceProviderImpl.java:511)
>       at 
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.lambda$startSyncDataFromOM$0(OzoneManagerServiceProviderImpl.java:258)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at 
> java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ClassCastException: class 
> org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo cannot be cast to class 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo 
> (org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo and 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo are in unnamed module of loader 
> 'app')
>       at 
> org.apache.hadoop.ozone.recon.tasks.NSSummaryTaskWithFSO.processWithFSO(NSSummaryTaskWithFSO.java:90)
>       at 
> org.apache.hadoop.ozone.recon.tasks.NSSummaryTask.process(NSSummaryTask.java:97)
>       at 
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.lambda$consumeOMEvents$0(ReconTaskControllerImpl.java:113)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-11187) Fix Event Handling Corruption in OMDBUpdatesHandler to Prevent ClassCastException in Recon Server

Reply via email to