[
https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791866#comment-13791866
]
Gabriel Reid commented on CRUNCH-278:
-------------------------------------
I like the idea behind it, and it looks good to me. Definitely good to get rid
of what was basically two difference implementations of the
MapsideJoinStrategy, and I have the feeling that this could be useful for a
number of things outside of the context of mapside joins as well, even if I
don't have a totally clear idea of what that would be exactly.
One thing I would mention is to add some docs on the ReadableSourceBundle,
particularly on the configure method to specify where it gets called in the
lifecycle. I understand how it works now, but I'm pretty sure I'll forget that
before the next time I need to know it :-)
The idea of generalizing ReadableSourceBundle to allow including DoFns in it
sounds interesting. If I think about it, it's a kind of optimization so that
you could do something like filter/transform values from the in-memory side of
the PCollection within the context of the MapsideJoin instead of doing it
within the context of a MR job leading up to the use of the MapsideJoin. Am I
seeing that correctly? Or do you see new functionality that would be added that
isn't possible at all right now? In case it's just an optimization, maybe
better hold off on that for this ticket, as it seems like there could be a fair
bit of extra complexity that could creep in there.
> Improvements to MapsideJoin code
> --------------------------------
>
> Key: CRUNCH-278
> URL: https://issues.apache.org/jira/browse/CRUNCH-278
> Project: Crunch
> Issue Type: Bug
> Components: Core, MapReduce Patterns
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the
> in-memory and MR-based Pipeline instances has always bugged me, so I set out
> to eliminate the distinction between the two impls by creating a new
> interface, ReadableSourceBundle<T>, that encapsulates the MR and in-memory
> specific logic for doing mapside joins in order to remove the special-case
> code in MapsideJoinStrategy and hopefully make other implementations that use
> our mapside-join patterns much easier to test.
--
This message was sent by Atlassian JIRA
(v6.1#6144)