[ https://issues.apache.org/jira/browse/CRUNCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053180#comment-14053180 ]
Gabriel Reid commented on CRUNCH-437: ------------------------------------- I'm a bit late on this, but for future reference, another way of doing this would have been replacing the HashMultimap with an ArrayListMultimap (which has the semantics of Map<K, List<V>>) > Fix Crunch Spark duplicate value aggregation > -------------------------------------------- > > Key: CRUNCH-437 > URL: https://issues.apache.org/jira/browse/CRUNCH-437 > Project: Crunch > Issue Type: Bug > Reporter: Josh Wills > Fix For: 0.8.4, 0.11.0 > > Attachments: CRUNCH-437.patch > > > The current Crunch-on-Spark mapside combiner uses a Multimap of key-value > pairs to cache values for local aggregation. This is awesome, except it means > that identical key-value outputs before a shuffle will only have one copy in > the Multimap, which means that the aggregation counts may not be correct. We > should fix it to use a proper Map<K, List<V>> to ensure that duplicate values > are aggregated correctly. -- This message was sent by Atlassian JIRA (v6.2#6252)