Grant, I took a look at your patch. It seems as though you did something similar to what I did. However, I believe that there's still room for improvement as there are things being calculated unnecessarily for no apparent reason. Could you please read my previous post? At least the "excursus" bit. I may be totally wrong, though: some particular parts were a bit obscure to me. Perhaps you (or Shashikant) can throw some light in there? We might be able to release a bigger/better patch.
>> I think your data set ran, for 10 iterations, in just over 2 minutes >> and that was with the profiler hooked up, too. Um... I also did that and, while it was considerably faster than before, it took about ~2hs to complete (it used to take days, mind you), using a 4 node hadoop cluster. The actual vector clustering only, that is the final step, took just over an hour: Started at: Tue Jul 28 17:44:20 ART 2009 Finished at: Tue Jul 28 18:46:24 ART 2009 Finished in: 1hrs, 2mins, 4sec How exactly did you launch the job? What convergence delta did you choose? Hoy many clusters did you set up initially?
