Re: Distance calculation performance issue

Grant Ingersoll Wed, 29 Jul 2009 18:49:58 -0700


On Jul 29, 2009, at 9:07 AM, nfantone wrote:

Grant, I took a look at your patch. It seems as though you did
something similar to what I did. However, I believe that there's still
room for improvement as there are things being calculated
unnecessarily for no apparent reason. Could you please read my
previous post? At least the "excursus" bit. I may be totally wrong,
though: some particular parts were a bit obscure to me. Perhaps you
(or Shashikant) can throw some light in there? We might be able to
release a bigger/better patch.

Agreed, can you put your changes up as a patch on MAHOUT-121? Thatway we can do file diffs, etc.

I think your data set ran, for 10 iterations, in just over 2minutes
and that was with the profiler hooked up, too.


Um... I also did that and, while it was considerably faster than
before, it took about ~2hs to complete (it used to take days, mind
you), using a 4 node hadoop cluster. The actual vector clustering
only, that is the final step, took just over an hour:

Started at: Tue Jul 28 17:44:20 ART 2009
Finished at: Tue Jul 28 18:46:24 ART 2009
Finished in: 1hrs, 2mins, 4sec

How exactly did you launch the job? What convergence delta did you
choose? Hoy many clusters did you set up initially?

--input ../nfantone/user.data --clusters ../nfantone/output/clusters --k 10 --output ../content/nfantone/output/ --convergence 0.01 --overwrite

So, it wasn't exactly what you were running. I will try to run your'sat some point.


-Grant

Re: Distance calculation performance issue

Reply via email to