Yeah, my experience in these kinds of situations is that you need to come up with a "dummy" or singleton version of U for the case where there is only a single T and do that conversion on the map side of the job, before the combiner runs. I think Chao had an issue like this awhile ago, where he had a PTable<String, Double> and wanted to write a combiner that would return a PTable<String, Collection<Double>>. The solution was to convert the map-side object to a PTable<String, Collection<Double>>, where the value on the map-side was a singleton list containing just that double value. Does that sort of trick work here?
On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre <[email protected]> wrote: > Ok so the feature you are trying to achieve is the proactive combination of > data before performing the GBK like the javadoc describes. Essentially in > that situation the CombineFn is being used as a Combiner[1] to combine the > data local to that mapper before doing the GBK and then further combining > the data in the reduce operation. It will not necessarily eliminate the > need for all processing in the reduce. > > If you want to use this functionality you will need to do the following: > > PTable<S, T> map to PTable<S, U> > PTable<S, U> gbk to PGT<S, U> > PGT<S, U> combine PTable<S, U> > > This will take advantage of any optimization provided by the CombineFn. > > [1] - http://wiki.apache.org/hadoop/HadoopMapReduce > > > > On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas <[email protected] > >wrote: > > > Hello Micah, > > Yes we are using MapFn now. That aggregation and computation is being > done > > in reduce phase. As CombineFn after GBK runs into map side, then those > most > > computations can be done in map side which are now running in reduce > phase. > > Some smaller aggregations and computations can be done on reduce phase. > > My point was to do some aggregation (and create a new object) in map > phase > > instead of in reduce phase. > > > > Thanks, > > Chandan > > > > > > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre <[email protected]> > wrote: > > > > > Chandan, > > > I think what you are wanting will just be a simple MapFn instead of > a > > > CombineFn. The doc of the CombineFn[1] sounds like what you want with > > the > > > statement "A special > > > DoFn< > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html> > > > implementation > > > that converts an > > > Iterable< > > > > > > http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true > > > > > > > of > > > values into a single value" but it is expecting the value to be of the > > same > > > time. Since you are wanting to combine the values into a different > form > > it > > > should be fairly trivial to write a MapFn that converts the Iterable<T> > > -> > > > U. > > > > > > [1] - > > > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html > > > > > > > > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas <[email protected] > > > >wrote: > > > > > > > I was trying to refactoring some stuffs and trying to use combineFn. > > > > But when I went into deeper, found that I can't do it as Crunch > doesn't > > > > allow it the functionality I needed. For example, I have a > > > > PGroupedTable<S,T>. I wanted to apply CombineFn<S,T> on it and wanted > > to > > > > get PCollection<S,U> instead of T. Right now, CombineFn allows only > > same > > > > type as return value. The use case of this need is that there will be > > > some > > > > time saving in sorting. It's natural that when aggregating some > objects > > > at > > > > map side can create a new different type object. > > > > > > > > Any thought on it? Am I missing any thing? If this can be written in > > > > different way using existing way please let me know. > > > > > > > > Thanks > > > > Chandan > > > > > > > > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
