[ 
https://issues.apache.org/jira/browse/CRUNCH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828646#comment-13828646
 ] 

David Whiting commented on CRUNCH-301:
--------------------------------------

That looks like exactly what I was expecting, good catch.

> Cogrouping tables where RHS has a Scala tuple value type causes duplicated 
> and missing RHS values
> -------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-301
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-301
>             Project: Crunch
>          Issue Type: Bug
>          Components: Scrunch
>    Affects Versions: 0.8.0
>         Environment: Hadoop 2
>            Reporter: David Whiting
>         Attachments: CRUNCH-301.patch, IsolatedBug.scala
>
>
> Suppose you have three record types, Rec1, Rec2 and Rec3.
> Rec1 references Rec2 via key1, and Rec2 references Rec3 (one-to-many) by 
> key2. If you innerJoin Rec2 and Rec3 to make a PCollection[(Rec2,Rec3)] and 
> they cogroup it against Rec1, then instead of surfacing n different 
> (Rec2,Rec3) tuples applicable to the Rec1, it surfaces just one of the (Rec2, 
> Rec3) tuples multiple times.
> This only happens when running with MRPipeline, and not with MemPipeline.
> Attached is the simplest complete program I could come up with which will 
> produce this unexpected result:
> The result that is produced is:
> Rec1(1,tjena) Rec1(1,hello)   (Rec2(1,a,0.5),Rec3(a,4))       
> (Rec2(1,a,0.5),Rec3(a,4))       (Rec2(1,a,0.5),Rec3(a,4))       
> (Rec2(1,a,0.5),Rec3(a,4))
> Rec1(2,goodbye)       (Rec2(2,c,9.9),Rec3(c,6))
> As you can see, there's a single (Rec2, Rec3) tuple repeated many times, 
> instead of showing all the distinct ones. This does not happen if you join 
> against Rec2 on its own.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to