Re: Early Stage Benchmarking

Jeff Eastman Tue, 14 Sep 2010 06:24:09 -0700

45 mb is pretty small. When you run your job, can you look at the JobTracker web page? It should tell you how many mappers and reducers youare getting on each run. The hadoop hdfs block size is 128 mb by defaultso my guess is you are only getting one split, one mapper and thus noimprovements on larger clusters. Try clustering a dozen copies of thatfile and you should see more improvements with more instances. Also, youmay need to up the number of reducers --maxRed as the default is 2.


On 9/14/10 6:01 AM, Grant Ingersoll wrote:

I have some questions on benchmarking that I wanted to get others opinions on.


This week I have been trying out EMR and KMeans with the goal of doing some 
benchmarking both for the community and for Taming Text.  For starters, I put 
up a file of ~45 MB containing roughly 110K sparse vectors.  I know, pretty 
small, but it is a start.  I tried this out on 2, 4 and 8 instances.  The time 
to complete the clustering for all variations in preliminary runs (I haven't 
done repeats yet to get an average) was about the same.  I'm guessing, this is 
due to either the overhead of Hadoop or possibly the fact that the file is so 
small that it isn't split, but, since I'm a newbie to EMR, I thought I would 
ask what others opinions are.  I have done no Hadoop tuning at this point.  
What do people think?  Should I be seeing more speedup at this point?

FWIW, I am in the process right now of copying over all ASF mail archives to S3 
(~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has free 
inbound now) and plan on testing on a larger set once I can get them into 
Mahout format.  If anyone has anything bigger and can share it, let me know.

FTR, I ran: elastic-mapreduce -j j-3QNGDH7H7EXG8  --jar 
s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class 
org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg 
s3n://PATH/part-out.vec --arg --clusters --arg 
s3n://news-vecs/kmeans/clusters-9-11/ --arg -k --arg 10 --arg --output --arg 
s3n://PATH/out-9-11/ --arg --distanceMeasure --arg  
org.apache.mahout.common.distance.CosineDistanceMeasure --arg 
--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg 
--clustering

-Grant

Re: Early Stage Benchmarking

Reply via email to