[
https://issues.apache.org/jira/browse/ACCUMULO-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018341#comment-15018341
]
Keith Turner commented on ACCUMULO-4062:
----------------------------------------
Personally, I think the issue of deduping the same col+values in different
order that Josh brought up is a good reason not to do it. Because I think
handling this case properly would be expensive. Maybe its cheap if we do not
worry about that case, but since its a half solution user concerned about this
still may have to dedupe outside the batchwriter.
Could do something like the way java streams work e.g. {{new
ZipOutputStream(new FileOutputStream())}}. One possibility is that we create a
DedupingBatchWriter that wraps a BatchWriter like {{new
DedupingBatchWriter(batchWriter, options)}}.
Also Mutations are not usually used as a key in Accumulo code. Most code just
keys on the mutations row. The Mutations hashcode and equals functions would
need a good set of unit test added if something in Accumulo were going to rely
on them.
> Change MutationSet.mutations to use HashSet
> -------------------------------------------
>
> Key: ACCUMULO-4062
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4062
> Project: Accumulo
> Issue Type: Improvement
> Components: client
> Reporter: Dave Marion
>
> Change TabletServerBatchWriter.MutationSet.mutations from a
> {code}
> HashMap<String,List<Mutation>>
> {code}
> to
> {code}
> HashMap<String,HashSet<Mutation>>
> {code}
> so that duplicate mutations added by a client are not sent to the server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)