[ 
https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792410#comment-13792410
 ] 

Gabriel Reid commented on CRUNCH-278:
-------------------------------------

So just to make sure I'm on the same page here: I'm thinking that in 
MapsideJoin case, the way it would work today is like this:

{code}
PTable<ImmutableBytesWritable,Result> htableContents = 
pipeline.read(FromHBase.table());
PTable<A,B> convertedHTable = htableContents.parallelDo(new DoSomethingFn());
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable, 
convertedHTable);
{code}

and this would have the drawback that created the convertedHTable would require 
a whole MR job to be kicked off in order to get to convertedHTable, although 
what we want is to have the conversion to convertedHTable happen in the 
initialize method in the MapsideJoin to avoid kicking off the MR job.

Wouldn't this be possible with something like a "materialized" PCollection, 
which could then operate in the same way as the in-memory pcollections? So then 
we would end with something like this:

{code}
PTable<ImmutableBytesWritable,Result> htableContents = 
pipeline.read(FromHBase.table());
PTable<A,B> convertedHTable = new 
MaterializedPCollection(htableContents).parallelDo(new DoSomethingFn());
PTable<A,Pair<C,B>> joined = new MapsideJoinStrategy().join(anotherPTable, 
convertedHTable);
{code}
Then when materialize() was called on a MaterializedPCollection, we would just 
materialize the root PCollection and load everything in memory and pass it 
through the rest of it's pipeline in memory so that the processing of the 
DoSomethingFn would occur in memory in the mapper. I guess that this would also 
imply that calling Pipeline#write on a MaterializedCollection would throw an 
exception, unless there was some way of getting around that.

Is that kind of what you had in mind? Or am I talking about something totally 
different?


> Improvements to MapsideJoin code
> --------------------------------
>
>                 Key: CRUNCH-278
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-278
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the 
> in-memory and MR-based Pipeline instances has always bugged me, so I set out 
> to eliminate the distinction between the two impls by creating a new 
> interface, ReadableSourceBundle<T>, that encapsulates the MR and in-memory 
> specific logic for doing mapside joins in order to remove the special-case 
> code in MapsideJoinStrategy and hopefully make other implementations that use 
> our mapside-join patterns much easier to test.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to