I have done some testing and have been unable to demonstrate a big
difference in allocating versus re-using.  Re-using is, however, *really*
error prone.

I think that most of the supposed cost of new allocations is actually the
cost of copying of large data rather than the cost of allocating the
container.  Here, the largest copy is the new DenseVector.

All of these pale behind bad arithmetic and no combiner.

On Wed, Nov 2, 2011 at 2:37 PM, Frank Scholten <[email protected]>wrote:

> Maybe not a major thing but in the DirichletMapper I see that
> Writables are not reused but new-ed
>
> Line 44: context.write(new Text(String.valueOf(k)), v);
>
> and in the for loop in the setup method
>
> Line 58: context.write(new Text(Integer.toString(i)), new
> VectorWritable(new DenseVector(0)));
>
> See
> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>
> Frank
>
> On Wed, Nov 2, 2011 at 10:13 PM, Grant Ingersoll <[email protected]>
> wrote:
> > Tim Potter and I have tried running Dirchlet in the past on the ASF
> email set on EC2 and it didn't seem to scale all that well, so I was
> wondering if people had ideas on improving it's speed.  One question I had
> is whether we could inject a Combiner into the process?  Ted also mentioned
> that there might be faster ways to check the models, but I will ask him to
> elaborate.
> >
> > Thanks,
> > Grant
>

Reply via email to