45 mb is pretty small. When you run your job, can you look at the Job
Tracker web page? It should tell you how many mappers and reducers you
are getting on each run. The hadoop hdfs block size is 128 mb by default
so my guess is you are only getting one split, one mapper and thus no
improvements on larger clusters. Try clustering a dozen copies of that
file and you should see more improvements with more instances. Also, you
may need to up the number of reducers --maxRed as the default is 2.
On 9/14/10 6:01 AM, Grant Ingersoll wrote:
I have some questions on benchmarking that I wanted to get others opinions on.
This week I have been trying out EMR and KMeans with the goal of doing some
benchmarking both for the community and for Taming Text. For starters, I put
up a file of ~45 MB containing roughly 110K sparse vectors. I know, pretty
small, but it is a start. I tried this out on 2, 4 and 8 instances. The time
to complete the clustering for all variations in preliminary runs (I haven't
done repeats yet to get an average) was about the same. I'm guessing, this is
due to either the overhead of Hadoop or possibly the fact that the file is so
small that it isn't split, but, since I'm a newbie to EMR, I thought I would
ask what others opinions are. I have done no Hadoop tuning at this point.
What do people think? Should I be seeing more speedup at this point?
FWIW, I am in the process right now of copying over all ASF mail archives to S3
(~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has free
inbound now) and plan on testing on a larger set once I can get them into
Mahout format. If anyone has anything bigger and can share it, let me know.
FTR, I ran: elastic-mapreduce -j j-3QNGDH7H7EXG8 --jar
s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class
org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg
s3n://PATH/part-out.vec --arg --clusters --arg
s3n://news-vecs/kmeans/clusters-9-11/ --arg -k --arg 10 --arg --output --arg
s3n://PATH/out-9-11/ --arg --distanceMeasure --arg
org.apache.mahout.common.distance.CosineDistanceMeasure --arg
--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg
--clustering
-Grant