On Tue, Sep 14, 2010 at 6:01 AM, Grant Ingersoll <[email protected]>wrote:
> I have some questions on benchmarking that I wanted to get others opinions > on. > > This week I have been trying out EMR and KMeans with the goal of doing some > benchmarking both for the community and for Taming Text. For starters, I > put up a file of ~45 MB containing roughly 110K sparse vectors. I know, > pretty small, but it is a start. I tried this out on 2, 4 and 8 instances. > The time to complete the clustering for all variations in preliminary runs > (I haven't done repeats yet to get an average) was about the same. > I'm guessing, this is due to either the overhead of Hadoop or possibly the > fact that the file is so small that it isn't split, I'm betting this is what you are seeing. To do proper multi-node benchmarking, I think you'll have to up your input size by a couple orders of magnitude. > FWIW, I am in the process right now of copying over all ASF mail archives > to S3 (~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has > free inbound now) and plan on testing on a larger set once I can get them > into Mahout format. If anyone has anything bigger and can share it, let me > know. > That's certainly a nicer set! -jake
