Re: MapReduce Stats calculations

Grant Ingersoll Fri, 06 May 2011 10:16:10 -0700

Meant to send this to dev@

On May 6, 2011, at 9:58 AM, Sean Owen wrote:


> Hadoop has something like this:
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html

Cool, and more importantly seems to provide a framework for such pieces.  I'll 
try that one out too.

> 
> I find there's a very strong and unfortunate tension between
> reusability and performance in some cases. Having a discrete stage to
> compute something like this is good; if it can be computed inline in a
> prior stage and output on the side, that's a big performance savings.
> 
> I also find myself tempted to construct a bunch of M/R primitives. For
> now I am trying to restrict my thinking to refactoring pieces that can
> come out easily, and that are used already in at least one place.

I think that's in line w/ what I did on M-686:  I put in variance and std. dev. 
b/c it needs them.  I just put them in a place that allows others to add as we 
need them (along the lines of what Ted is suggesting)

> 
> I suppose I mean: if you want to write primitive X and can't find one
> good use for it yet in Mahout, I'd hold off, but otherwise would
> surely add it and use it.
> 
> 
> On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <[email protected]> wrote:
>> MAHOUT-688 has a M/R job to calculate std. deviation for document 
>> frequencies so that it can prune noisy words.  I'm thinking of making it a 
>> bit more generic and adding a stats package to org.apache.mahout.math.hadoop 
>> that contains this and other basic stats calculations (mean, variance, sum 
>> of squares, etc.) that operate in M/R.
>> 
>> Is that useful or am I re-inventing the wheel here or wasting time?  Seems 
>> like such a beast should already exist, but a quick search didn't turn up 
>> much.
>> 
>> -Grant

Re: MapReduce Stats calculations

Reply via email to