Re: Clustering from DB

Grant Ingersoll Thu, 23 Jul 2009 05:55:08 -0700


On Jul 22, 2009, at 10:22 AM, nfantone wrote:

After setting the cluster up with 6 computers (two of them being
QuadCore and the others, DualCore, totaling 16 slave cores) and
running a KMeansDriver job with 32 reduce tasks and ~80 map tasks
spawned it's STILL awfully slow.

./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.001 -k 200

Using a pretty small dataset of 62MB it took more than a whole day to
complete. Datanodes and Jobtrackers logs don't show any visible
errors, either. Would you mind sharing any piece of advice that could
help me tune this thing up with my settings?


That does seem like a long time.

Is your data sparse or dense?

Perhaps a larger convergence value might help (-d, I believe).

Is there any chance your data is publicly shareable? Come to think ofit, with the vector representations, as long as you don't publish thekey (which terms map to which index), I would think most all data ispublicly shareable.

Are you on trunk of Mahout? I think we still need more profiling toget a better idea of where improvements can be made.


-Grant

Re: Clustering from DB

Reply via email to