nfantone wrote:
Well, I grew tired of watching the whole thing run and stopped it. I,
then, started another test, this time around using a smaller dataset
of 3Gb and it is still taking way too long.
See inline comments.
You are only specifying a single reducer. Try increasing that as below.
I did. I set it to my K value (200).
Way too big given your single node operation. See below.
No, number of nodes is the number of nodes (computers) in your cluster. You
did not say how many nodes you are running on.
I'm running and compiling the application on one simple desktop
computer at work, and that isn't likely to change after the
development process is finished.
This is the root of your problem: You only have a single node in your
cluster. Running Hadoop in this configuration is possible, but it will
be much slower than if you had more machines. Perhaps you can get some
interest from some of your other colleagues in donating some storage and
cycles on their machines to your effort. When I was at CollabNet, I got
a dozen developer's machines running in a cluster so I could test out
the early clustering stuff. These machines typically had gigs of free
storage and were not heavily utilized in CPU capacity, so nobody ever
noticed I was running jobs on them at all.
Alternatively, for a couple of dollars on AWS you can run the job on a
cluster of your own. For your job I would expect the cost to be
literally in the couple of dollars range.
You will find KMeans will scale almost linearly with the number of boxes
you throw at it.