[
https://issues.apache.org/jira/browse/HDDS-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866295#comment-17866295
]
Arafat Khan commented on HDDS-11187:
------------------------------------
h3. Problem Description
The existing implementation of *{{OMDBUpdatesHandler}}* in Recon saves events
from all tables into a single map, {*}{{omdbLatestUpdateEvents}}{*}, using just
the key structure. This leads to corruption when different tables, such as
*{{keyTable}}* and {*}{{deletedTable}}{*}, use the same key structure
({*}{{/volumeName/bucketName/keyName}}{*}). Consequently, events from different
tables with identical keys can overwrite each other, resulting in data
inconsistencies and causing issues like *{{ClassCastException}}* when the wrong
event type is retrieved and cast downstream.
h3. Solution Description
To resolve this issue, we propose modifying the *{{omdbLatestUpdateEvents}}*
map to include an additional layer that incorporates the table name. The new
structure will be a nested map: \{*}{{Map<String, Map<Object,
OMDBUpdateEvent>>}}{*}, where the outer map's key is the table name, and the
inner map's key is the actual key structure. This ensures that events from
different tables with the same key structure are stored separately, avoiding
conflicts.
h3. Fix for ClassCastException and Future Improvements
This solution will fix the *{{ClassCastException}}* problem from the Recon end
by ensuring that events from different tables are isolated within the map,
preventing them from overwriting each other. However, we still need to address
the root cause from the OM end to prevent the creation of incorrect events.
Ensuring that each event is correctly classified and stored in the appropriate
table at the source will further reinforce data integrity and prevent similar
issues in the future. Additionally, logs have been added earlier to capture
such events that can lead to {*}{{ClassCastException}}{*}. Generally, we ignore
these events because these corrupted events are generated from the OM side of
the code and hence need to be fixed. Our changes in this patch only fix the
problem of possible corruption in event creation on the Recon side. The problem
at the OM side still persists and needs to be fixed. The newly added logs in
the previous patch by [HDDS-8310|[https://github.com/apache/ozone/pull/5043]]
will help in identifying and reporting these issues. We just have to wait for
the reporting of these logs.
> ClassCastException in Recon Server during FileSizeCountTask Execution
> ---------------------------------------------------------------------
>
> Key: HDDS-11187
> URL: https://issues.apache.org/jira/browse/HDDS-11187
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Recon
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
> Fix For: 1.5.0
>
>
> A *ClassCastException* occurs in the Recon server during the
> FileSizeCountTask, where RepeatedOmKeyInfo is incorrectly cast to OmKeyInfo,
> causing task processing to fail.
>
> {code:java}
> 2024-06-11 10:40:03,700 INFO
> org.apache.hadoop.ozone.recon.tasks.FileSizeCountTask: Completed a 'process'
> run of FileSizeCountTask.
> 2024-06-11 10:40:03,700 ERROR
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl: Unexpected error
> :
> java.util.concurrent.ExecutionException: java.lang.ClassCastException: class
> org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo cannot be cast to class
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo
> (org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo and
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo are in unnamed module of loader
> 'app')
> at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
> at
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.processTaskResults(ReconTaskControllerImpl.java:247)
> at
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.consumeOMEvents(ReconTaskControllerImpl.java:118)
> at
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.syncDataFromOM(OzoneManagerServiceProviderImpl.java:511)
> at
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.lambda$startSyncDataFromOM$0(OzoneManagerServiceProviderImpl.java:258)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at
> java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
> at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ClassCastException: class
> org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo cannot be cast to class
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo
> (org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo and
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo are in unnamed module of loader
> 'app')
> at
> org.apache.hadoop.ozone.recon.tasks.NSSummaryTaskWithFSO.processWithFSO(NSSummaryTaskWithFSO.java:90)
> at
> org.apache.hadoop.ozone.recon.tasks.NSSummaryTask.process(NSSummaryTask.java:97)
> at
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.lambda$consumeOMEvents$0(ReconTaskControllerImpl.java:113)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> ... 3 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]