[ 
https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791866#comment-13791866
 ] 

Gabriel Reid commented on CRUNCH-278:
-------------------------------------

I like the idea behind it, and it looks good to me. Definitely good to get rid 
of what was basically two difference implementations of the 
MapsideJoinStrategy, and I have the feeling that this could be useful for a 
number of things outside of the context of mapside joins as well, even if I 
don't have a totally clear idea of what that would be exactly.

One thing I would mention is to add some docs on the ReadableSourceBundle, 
particularly on the configure method to specify where it gets called in the 
lifecycle. I understand how it works now, but I'm pretty sure I'll forget that 
before the next time I need to know it :-)

The idea of generalizing ReadableSourceBundle to allow including DoFns in it 
sounds interesting. If I think about it, it's a kind of optimization so that 
you could do something like filter/transform values from the in-memory side of 
the PCollection within the context of the MapsideJoin instead of doing it 
within the context of a MR job leading up to the use of the MapsideJoin. Am I 
seeing that correctly? Or do you see new functionality that would be added that 
isn't possible at all right now? In case it's just an optimization, maybe 
better hold off on that for this ticket, as it seems like there could be a fair 
bit of extra complexity that could creep in there.

> Improvements to MapsideJoin code
> --------------------------------
>
>                 Key: CRUNCH-278
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-278
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the 
> in-memory and MR-based Pipeline instances has always bugged me, so I set out 
> to eliminate the distinction between the two impls by creating a new 
> interface, ReadableSourceBundle<T>, that encapsulates the MR and in-memory 
> specific logic for doing mapside joins in order to remove the special-case 
> code in MapsideJoinStrategy and hopefully make other implementations that use 
> our mapside-join patterns much easier to test.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to