Re: Process of CombineFn returns ?

Josh Wills Thu, 17 Oct 2013 16:48:07 -0700

On Thu, Oct 17, 2013 at 4:41 PM, Chandan Biswas <[email protected]>wrote:


> Yeah, I agree with Micah that it will not eliminate the reduce phase
> entirely. But the dummy object of U suggested by Josh (or converting to U
> type in map for every record)  will not improve performance because same
> amounts of records will be sorted and aggregated in the reduce phase.


I don't think that's true-- the records of type U will be combined on the
map-side, which would reduce the amount of data that is pushed over the
network and improve performance.

Can you give any additional details about what T and U are in this
scenario? :)



> But
> my point is, can we improve it by applying a combiner where the combineFn
> provides output as different type. If we have same type, we can use the
> combiner to do some aggregation in map side which improves performance.
> But, can we have some mechanism by which the same advantage can be achieved
> when combineFn emits different type. I think, emitting same type by
> CombineFn has restricted its use. Can we have new CombineFn that allows us
> to output different type not only same type as input?
>
>
> On Thu, Oct 17, 2013 at 5:05 PM, Josh Wills <[email protected]> wrote:
>
> > Yeah, my experience in these kinds of situations is that you need to come
> > up with a "dummy" or singleton version of U for the case where there is
> > only a single T and do that conversion on the map side of the job, before
> > the combiner runs. I think Chao had an issue like this awhile ago, where
> he
> > had a PTable<String, Double> and wanted to write a combiner that would
> > return a PTable<String, Collection<Double>>. The solution was to convert
> > the map-side object to a PTable<String, Collection<Double>>, where the
> > value on the map-side was a singleton list containing just that double
> > value. Does that sort of trick work here?
> >
> >
> > On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre <[email protected]>
> wrote:
> >
> > > Ok so the feature you are trying to achieve is the proactive
> combination
> > of
> > > data before performing the GBK like the javadoc describes.  Essentially
> > in
> > > that situation the CombineFn is being used as a Combiner[1] to combine
> > the
> > > data local to that mapper before doing the GBK and then further
> combining
> > > the data in the reduce operation.  It will not necessarily eliminate
> the
> > > need for all processing in the reduce.
> > >
> > > If you want to use this functionality you will need to do the
> following:
> > >
> > > PTable<S, T> map to PTable<S, U>
> > > PTable<S, U> gbk to PGT<S, U>
> > > PGT<S, U> combine PTable<S, U>
> > >
> > > This will take advantage of any optimization provided by the CombineFn.
> > >
> > > [1] - http://wiki.apache.org/hadoop/HadoopMapReduce
> > >
> > >
> > >
> > > On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas <[email protected]
> > > >wrote:
> > >
> > > > Hello Micah,
> > > > Yes we are using MapFn now. That aggregation and computation is being
> > > done
> > > > in reduce phase. As CombineFn after GBK runs into map side, then
> those
> > > most
> > > > computations can be done in map side which are now running in reduce
> > > phase.
> > > > Some smaller aggregations and computations can be done on reduce
> phase.
> > > > My point was to do some aggregation (and create a new object) in map
> > > phase
> > > > instead of in reduce phase.
> > > >
> > > > Thanks,
> > > > Chandan
> > > >
> > > >
> > > > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre <[email protected]>
> > > wrote:
> > > >
> > > > > Chandan,
> > > > >    I think what you are wanting will just be a simple MapFn instead
> > of
> > > a
> > > > > CombineFn.  The doc of the CombineFn[1] sounds like what you want
> > with
> > > > the
> > > > > statement "A special
> > > > > DoFn<
> > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html>
> > > > > implementation
> > > > > that converts an
> > > > > Iterable<
> > > > >
> > > >
> > >
> >
> http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true
> > > > > >
> > > > > of
> > > > > values into a single value" but it is expecting the value to be of
> > the
> > > > same
> > > > > time.  Since you are wanting to combine the values into a different
> > > form
> > > > it
> > > > > should be fairly trivial to write a MapFn that converts the
> > Iterable<T>
> > > > ->
> > > > > U.
> > > > >
> > > > > [1] -
> > > > >
> > >
> http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html
> > > > >
> > > > >
> > > > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I was trying to refactoring some stuffs and trying to use
> > combineFn.
> > > > > > But when I went into deeper, found that I can't do it as Crunch
> > > doesn't
> > > > > > allow it the functionality I needed. For example, I have a
> > > > > > PGroupedTable<S,T>. I wanted to apply CombineFn<S,T> on it and
> > wanted
> > > > to
> > > > > > get PCollection<S,U> instead of T. Right now, CombineFn allows
> only
> > > > same
> > > > > > type as return value. The use case of this need is that there
> will
> > be
> > > > > some
> > > > > > time saving in sorting. It's natural that when aggregating some
> > > objects
> > > > > at
> > > > > > map side can create a new different type object.
> > > > > >
> > > > > > Any thought on it? Am I missing any thing? If this can be written
> > in
> > > > > > different way using existing way please let me know.
> > > > > >
> > > > > > Thanks
> > > > > > Chandan
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Process of CombineFn returns ?

Reply via email to