Re: Clustering from DB

Jeff Eastman Wed, 15 Jul 2009 13:56:38 -0700

nfantone wrote:

Well, I grew tired of watching the whole thing run and stopped it. I,
then, started another test, this time around using a smaller dataset
of 3Gb and it is still taking way too long.
See inline comments.

You are only specifying a single reducer. Try increasing that as below.


I did. I set it to my K value (200).

Way too big given your single node operation. See below.

No, number of nodes is the number of nodes (computers) in your cluster. You
did not say how many nodes you are running on.


I'm running and compiling the application on one simple desktop
computer at work, and that isn't likely to change after the
development process is finished.

This is the root of your problem: You only have a single node in yourcluster. Running Hadoop in this configuration is possible, but it willbe much slower than if you had more machines. Perhaps you can get someinterest from some of your other colleagues in donating some storage andcycles on their machines to your effort. When I was at CollabNet, I gota dozen developer's machines running in a cluster so I could test outthe early clustering stuff. These machines typically had gigs of freestorage and were not heavily utilized in CPU capacity, so nobody evernoticed I was running jobs on them at all.

Alternatively, for a couple of dollars on AWS you can run the job on acluster of your own. For your job I would expect the cost to beliterally in the couple of dollars range.

You will find KMeans will scale almost linearly with the number of boxesyou throw at it.

Re: Clustering from DB

Reply via email to