Hi Gabriel, Seems like the attachment is missing.
Cheers, Chris On Tue, Jul 3, 2012 at 9:23 AM, Gabriel Reid <[email protected]> wrote: > Hi guys, > > Attached (hopefully) is a patch for an initial implementation of map > side joins. It's currently implemented as a static method in a class > called MapsideJoin, with the same interface as the existing Join class > (with only inner joins being implemented for now). The way it works is > that the right-side PTable of the join is put in the distributed cache > and then read by the join function at runtime. > > There's one spot that I can see for a potentially interesting > optimization -- MRPipeline#run is called once for each map side join > that is set up, but if the setup of the joins was done within > MRPipeline, then we could set up multiple map side joins in parallel > with a single call to MRPipeline#run. OTOH, a whole bunch of map side > joins in parallel probably isn't that common of an operation. > > If anyone feels like taking a look at the patch, any feedback is > appreciated. If nobody sees something that needs serious changes in > the patch, I'll commit it. > > - Gabriel > > > On Thu, Jun 21, 2012 at 9:09 AM, Gabriel Reid <[email protected]> > wrote: > > Replying to all... > > > > On Thu, Jun 21, 2012 at 8:40 AM, Josh Wills <[email protected]> wrote: > >> > >> So there's a philosophical issue here: should Crunch ever make > >> decisions about how to do something itself based on its estimates of > >> the size of the data sets, or should it always do exactly what the > >> developer indicates? > >> > >> I can make a case either way, but I think that no matter what, we > >> would want to have explicit functions for performing a join that reads > >> one data set into memory, so I think we can proceed w/the > >> implementation while folks weigh in on what their preferences are for > >> the default join() behavior (e.g., just do a reduce-side join, or try > >> to figure out the best join given information about the input data and > >> some configuration parameters.) > >> > > > > I definitely agree on needing to have an explicit way to invoke one or > > the other -- and in general I don't like having magic behind the > > scenes to decide on behaviour (especially considering Crunch is > > generally intended to be closer to the metal than Pig and Hive). I'm > > not sure if the runtime decision is something specific to some of my > > use cases or if it could be useful to a wider audience. > > > > The ability to dynamically decide at runtime whether a map side join > > should be used can also easily be tacked on outside of Crunch, and > > won't impact the underlying implementation (as you pointed out), so I > > definitely also agree on focusing on the underlying implementation > > first, and we can worry about the options used for exposing it later > > on. > > > > - Gabriel >
