[
https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791959#comment-13791959
]
Josh Wills commented on CRUNCH-278:
-----------------------------------
So I had two contexts in mind: in-memory for unit testing, but also having
these DoFns running inside of a MR context, where they're not strictly part of
the CrunchMapper/CrunchReducer flow, but operating more like embedded inside of
the initialize() process that is reading records in from the distributed cache
and then performing filters/transforms on them.
For example, think of being able to do mapside joins against (say) an HBase
table, where you could construct the PTable of key-value pairs that is loaded
in memory by reading the table into the client and then doing some processing
on those values inside of the map initialization vs. having to run a MR job to
process that data into a file as a pre-processing step to running the job. I'm
not sure if that's the sort of thing folks would be interested in doing, but it
seemed cool to me.
> Improvements to MapsideJoin code
> --------------------------------
>
> Key: CRUNCH-278
> URL: https://issues.apache.org/jira/browse/CRUNCH-278
> Project: Crunch
> Issue Type: Bug
> Components: Core, MapReduce Patterns
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the
> in-memory and MR-based Pipeline instances has always bugged me, so I set out
> to eliminate the distinction between the two impls by creating a new
> interface, ReadableSourceBundle<T>, that encapsulates the MR and in-memory
> specific logic for doing mapside joins in order to remove the special-case
> code in MapsideJoinStrategy and hopefully make other implementations that use
> our mapside-join patterns much easier to test.
--
This message was sent by Atlassian JIRA
(v6.1#6144)