45 mb is pretty small. When you run your job, can you look at the Job Tracker web page? It should tell you how many mappers and reducers you are getting on each run. The hadoop hdfs block size is 128 mb by default so my guess is you are only getting one split, one mapper and thus no improvements on larger clusters. Try clustering a dozen copies of that file and you should see more improvements with more instances. Also, you may need to up the number of reducers --maxRed as the default is 2.

On 9/14/10 6:01 AM, Grant Ingersoll wrote:
I have some questions on benchmarking that I wanted to get others opinions on.

This week I have been trying out EMR and KMeans with the goal of doing some 
benchmarking both for the community and for Taming Text.  For starters, I put 
up a file of ~45 MB containing roughly 110K sparse vectors.  I know, pretty 
small, but it is a start.  I tried this out on 2, 4 and 8 instances.  The time 
to complete the clustering for all variations in preliminary runs (I haven't 
done repeats yet to get an average) was about the same.  I'm guessing, this is 
due to either the overhead of Hadoop or possibly the fact that the file is so 
small that it isn't split, but, since I'm a newbie to EMR, I thought I would 
ask what others opinions are.  I have done no Hadoop tuning at this point.  
What do people think?  Should I be seeing more speedup at this point?

FWIW, I am in the process right now of copying over all ASF mail archives to S3 
(~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has free 
inbound now) and plan on testing on a larger set once I can get them into 
Mahout format.  If anyone has anything bigger and can share it, let me know.

FTR, I ran: elastic-mapreduce -j j-3QNGDH7H7EXG8  --jar 
s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class 
org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg 
s3n://PATH/part-out.vec --arg --clusters --arg 
s3n://news-vecs/kmeans/clusters-9-11/ --arg -k --arg 10 --arg --output --arg 
s3n://PATH/out-9-11/ --arg --distanceMeasure --arg  
org.apache.mahout.common.distance.CosineDistanceMeasure --arg 
--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg 
--clustering

-Grant



Reply via email to