On Jul 22, 2009, at 10:22 AM, nfantone wrote:
After setting the cluster up with 6 computers (two of them being
QuadCore and the others, DualCore, totaling 16 slave cores) and
running a KMeansDriver job with 32 reduce tasks and ~80 map tasks
spawned it's STILL awfully slow.
./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.001 -k 200
Using a pretty small dataset of 62MB it took more than a whole day to
complete. Datanodes and Jobtrackers logs don't show any visible
errors, either. Would you mind sharing any piece of advice that could
help me tune this thing up with my settings?
That does seem like a long time.
Is your data sparse or dense?
Perhaps a larger convergence value might help (-d, I believe).
Is there any chance your data is publicly shareable? Come to think of
it, with the vector representations, as long as you don't publish the
key (which terms map to which index), I would think most all data is
publicly shareable.
Are you on trunk of Mahout? I think we still need more profiling to
get a better idea of where improvements can be made.
-Grant