Re: map side ("replicated") joins in Crunch

Christian Tzolov Tue, 03 Jul 2012 01:42:56 -0700

Hi Gabriel,

Seems like the attachment is missing.


Cheers,
Chris

On Tue, Jul 3, 2012 at 9:23 AM, Gabriel Reid <[email protected]> wrote:

> Hi guys,
>
> Attached (hopefully) is a patch for an initial implementation of map
> side joins. It's currently implemented as a static method in a class
> called MapsideJoin, with the same interface as the existing Join class
> (with only inner joins being implemented for now). The way it works is
> that the right-side PTable of the join is put in the distributed cache
> and then read by the join function at runtime.
>
> There's one spot that I can see for a potentially interesting
> optimization -- MRPipeline#run is called once for each map side join
> that is set up, but if the setup of the joins was done within
> MRPipeline, then we could set up multiple map side joins in parallel
> with a single call to MRPipeline#run. OTOH, a whole bunch of map side
> joins in parallel probably isn't that common of an operation.
>
> If anyone feels like taking a look at the patch, any feedback is
> appreciated. If nobody sees something that needs serious changes in
> the patch, I'll commit it.
>
> - Gabriel
>
>
> On Thu, Jun 21, 2012 at 9:09 AM, Gabriel Reid <[email protected]>
> wrote:
> > Replying to all...
> >
> > On Thu, Jun 21, 2012 at 8:40 AM, Josh Wills <[email protected]> wrote:
> >>
> >> So there's a philosophical issue here: should Crunch ever make
> >> decisions about how to do something itself based on its estimates of
> >> the size of the data sets, or should it always do exactly what the
> >> developer indicates?
> >>
> >> I can make a case either way, but I think that no matter what, we
> >> would want to have explicit functions for performing a join that reads
> >> one data set into memory, so I think we can proceed w/the
> >> implementation while folks weigh in on what their preferences are for
> >> the default join() behavior (e.g., just do a reduce-side join, or try
> >> to figure out the best join given information about the input data and
> >> some configuration parameters.)
> >>
> >
> > I definitely agree on needing to have an explicit way to invoke one or
> > the other -- and in general I don't like having magic behind the
> > scenes to decide on behaviour (especially considering Crunch is
> > generally intended to be closer to the metal than Pig and Hive). I'm
> > not sure if the runtime decision is something specific to some of my
> > use cases or if it could be useful to a wider audience.
> >
> > The ability to dynamically decide at runtime whether a map side join
> > should be used can also easily be tacked on outside of Crunch, and
> > won't impact the underlying implementation (as you pointed out), so I
> > definitely also agree on focusing on the underlying implementation
> > first, and we can worry about the options used for exposing it later
> > on.
> >
> > - Gabriel
>

Re: map side ("replicated") joins in Crunch

Reply via email to