On Tue, Sep 14, 2010 at 6:01 AM, Grant Ingersoll <[email protected]>wrote:

> I have some questions on benchmarking that I wanted to get others opinions
> on.
>
> This week I have been trying out EMR and KMeans with the goal of doing some
> benchmarking both for the community and for Taming Text.  For starters, I
> put up a file of ~45 MB containing roughly 110K sparse vectors.  I know,
> pretty small, but it is a start.  I tried this out on 2, 4 and 8 instances.
>  The time to complete the clustering for all variations in preliminary runs
> (I haven't done repeats yet to get an average) was about the same.



> I'm guessing, this is due to either the overhead of Hadoop or possibly the
> fact that the file is so small that it isn't split,


I'm betting this is what you are seeing.

To do proper multi-node benchmarking, I think you'll have to up your input
size by a couple orders of magnitude.


> FWIW, I am in the process right now of copying over all ASF mail archives
> to S3 (~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has
> free inbound now) and plan on testing on a larger set once I can get them
> into Mahout format.  If anyone has anything bigger and can share it, let me
> know.
>

That's certainly a nicer set!

  -jake

Reply via email to