[ 
https://issues.apache.org/jira/browse/CRUNCH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wills updated CRUNCH-301:
------------------------------

    Attachment: CRUNCH-301.patch

[~davw] I used the attached patch and got this as output:

thugnet:isolation josh$ cat /tmp/isolation/output/part-r-00000 
Rec1(1,hello)   Rec1(1,tjena)   (Rec2(1,a,0.4),Rec3(a,4))       
(Rec2(1,a,0.5),Rec3(a,4))       (Rec2(1,b,0.6),Rec3(b,5))       
(Rec2(1,b,0.7),Rec3(b,5))
Rec1(2,goodbye) (Rec2(2,c,9.9),Rec3(c,6))

Is that roughly what you were expecting to see?

[~gabriel.reid] minor change to DoFns here to allow a Configuration to be 
passed to the DoFns we use in derived PTypes, which is what I had to do to get 
this working. Let me know if you're okay with it or if you think there is 
something cleaner.

> Cogrouping tables where RHS has a Scala tuple value type causes duplicated 
> and missing RHS values
> -------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-301
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-301
>             Project: Crunch
>          Issue Type: Bug
>          Components: Scrunch
>    Affects Versions: 0.8.0
>         Environment: Hadoop 2
>            Reporter: David Whiting
>         Attachments: CRUNCH-301.patch, IsolatedBug.scala
>
>
> Suppose you have three record types, Rec1, Rec2 and Rec3.
> Rec1 references Rec2 via key1, and Rec2 references Rec3 (one-to-many) by 
> key2. If you innerJoin Rec2 and Rec3 to make a PCollection[(Rec2,Rec3)] and 
> they cogroup it against Rec1, then instead of surfacing n different 
> (Rec2,Rec3) tuples applicable to the Rec1, it surfaces just one of the (Rec2, 
> Rec3) tuples multiple times.
> This only happens when running with MRPipeline, and not with MemPipeline.
> Attached is the simplest complete program I could come up with which will 
> produce this unexpected result:
> The result that is produced is:
> Rec1(1,tjena) Rec1(1,hello)   (Rec2(1,a,0.5),Rec3(a,4))       
> (Rec2(1,a,0.5),Rec3(a,4))       (Rec2(1,a,0.5),Rec3(a,4))       
> (Rec2(1,a,0.5),Rec3(a,4))
> Rec1(2,goodbye)       (Rec2(2,c,9.9),Rec3(c,6))
> As you can see, there's a single (Rec2, Rec3) tuple repeated many times, 
> instead of showing all the distinct ones. This does not happen if you join 
> against Rec2 on its own.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to