[
https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730510#action_12730510
]
Jerome Boulon commented on CHUKWA-338:
--------------------------------------
Ari,
Yes, a secondary sort (grouping comparator) will solve the issue but I'm not
sure if all current adaptors are in line with the concept of virtual offset so
that would be the first think to validate.
Also, if you have more than one value for the same key, you may want to double
check that they actually have the same size/content to make sure it's a real
duplicate and not an issue with the virtual offset, especially after rotation.
Since in my mind, the archiver is a background process, it should not be too
bad to allways check for real duplicates vs false duplicates (same SequenceId
but not same content).
> duplicate suppression in archiver
> ---------------------------------
>
> Key: CHUKWA-338
> URL: https://issues.apache.org/jira/browse/CHUKWA-338
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: Data Processors
> Reporter: Ari Rabkin
> Assignee: Ari Rabkin
> Fix For: 0.3.0
>
> Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate
> detection and suppression if we get multiple chunks with the same key.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.