[
https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792410#comment-13792410
]
Gabriel Reid commented on CRUNCH-278:
-------------------------------------
So just to make sure I'm on the same page here: I'm thinking that in
MapsideJoin case, the way it would work today is like this:
{code}
PTable<ImmutableBytesWritable,Result> htableContents =
pipeline.read(FromHBase.table());
PTable<A,B> convertedHTable = htableContents.parallelDo(new DoSomethingFn());
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable,
convertedHTable);
{code}
and this would have the drawback that created the convertedHTable would require
a whole MR job to be kicked off in order to get to convertedHTable, although
what we want is to have the conversion to convertedHTable happen in the
initialize method in the MapsideJoin to avoid kicking off the MR job.
Wouldn't this be possible with something like a "materialized" PCollection,
which could then operate in the same way as the in-memory pcollections? So then
we would end with something like this:
{code}
PTable<ImmutableBytesWritable,Result> htableContents =
pipeline.read(FromHBase.table());
PTable<A,B> convertedHTable = new
MaterializedPCollection(htableContents).parallelDo(new DoSomethingFn());
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable,
convertedHTable);
{code}
Then when materialize() was called on a MaterializedPCollection, we would just
materialize the root PCollection and load everything in memory and pass it
through the rest of it's pipeline in memory so that the processing of the
DoSomethingFn would occur in memory in the mapper. I guess that this would also
imply that calling Pipeline#write on a MaterializedCollection would throw an
exception, unless there was some way of getting around that.
Is that kind of what you had in mind? Or am I talking about something totally
different?
> Improvements to MapsideJoin code
> --------------------------------
>
> Key: CRUNCH-278
> URL: https://issues.apache.org/jira/browse/CRUNCH-278
> Project: Crunch
> Issue Type: Bug
> Components: Core, MapReduce Patterns
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the
> in-memory and MR-based Pipeline instances has always bugged me, so I set out
> to eliminate the distinction between the two impls by creating a new
> interface, ReadableSourceBundle<T>, that encapsulates the MR and in-memory
> specific logic for doing mapside joins in order to remove the special-case
> code in MapsideJoinStrategy and hopefully make other implementations that use
> our mapside-join patterns much easier to test.
--
This message was sent by Atlassian JIRA
(v6.1#6144)