On Jul 29, 2009, at 9:07 AM, nfantone wrote:
Grant, I took a look at your patch. It seems as though you did
something similar to what I did. However, I believe that there's still
room for improvement as there are things being calculated
unnecessarily for no apparent reason. Could you please read my
previous post? At least the "excursus" bit. I may be totally wrong,
though: some particular parts were a bit obscure to me. Perhaps you
(or Shashikant) can throw some light in there? We might be able to
release a bigger/better patch.
Agreed, can you put your changes up as a patch on MAHOUT-121? That
way we can do file diffs, etc.
I think your data set ran, for 10 iterations, in just over 2
minutes
and that was with the profiler hooked up, too.
Um... I also did that and, while it was considerably faster than
before, it took about ~2hs to complete (it used to take days, mind
you), using a 4 node hadoop cluster. The actual vector clustering
only, that is the final step, took just over an hour:
Started at: Tue Jul 28 17:44:20 ART 2009
Finished at: Tue Jul 28 18:46:24 ART 2009
Finished in: 1hrs, 2mins, 4sec
How exactly did you launch the job? What convergence delta did you
choose? Hoy many clusters did you set up initially?
--input ../nfantone/user.data --clusters ../nfantone/output/clusters --
k 10 --output ../content/nfantone/output/ --convergence 0.01 --overwrite
So, it wasn't exactly what you were running. I will try to run your's
at some point.
-Grant