[
https://issues.apache.org/jira/browse/CRUNCH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828323#comment-13828323
]
Josh Wills commented on CRUNCH-301:
-----------------------------------
Sounds like a deep copy issue-- will dig into this now.
> Cogrouping tables where RHS has a Scala tuple value type causes duplicated
> and missing RHS values
> -------------------------------------------------------------------------------------------------
>
> Key: CRUNCH-301
> URL: https://issues.apache.org/jira/browse/CRUNCH-301
> Project: Crunch
> Issue Type: Bug
> Components: Scrunch
> Affects Versions: 0.8.0
> Environment: Hadoop 2
> Reporter: David Whiting
> Attachments: IsolatedBug.scala
>
>
> Suppose you have three record types, Rec1, Rec2 and Rec3.
> Rec1 references Rec2 via key1, and Rec2 references Rec3 (one-to-many) by
> key2. If you innerJoin Rec2 and Rec3 to make a PCollection[(Rec2,Rec3)] and
> they cogroup it against Rec1, then instead of surfacing n different
> (Rec2,Rec3) tuples applicable to the Rec1, it surfaces just one of the (Rec2,
> Rec3) tuples multiple times.
> This only happens when running with MRPipeline, and not with MemPipeline.
> Attached is the simplest complete program I could come up with which will
> produce this unexpected result:
> The result that is produced is:
> Rec1(1,tjena) Rec1(1,hello) (Rec2(1,a,0.5),Rec3(a,4))
> (Rec2(1,a,0.5),Rec3(a,4)) (Rec2(1,a,0.5),Rec3(a,4))
> (Rec2(1,a,0.5),Rec3(a,4))
> Rec1(2,goodbye) (Rec2(2,c,9.9),Rec3(c,6))
> As you can see, there's a single (Rec2, Rec3) tuple repeated many times,
> instead of showing all the distinct ones. This does not happen if you join
> against Rec2 on its own.
--
This message was sent by Atlassian JIRA
(v6.1#6144)