Look at the streaming k means implementation. This heinous seeding algorithm goes away entirely.
Sent from my iPhone On Aug 14, 2013, at 13:35, B Kersbergen <[email protected]> wrote: > Hi, > > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified, > depending on the characteristics of my dataset it takes about 0.5 to 12 > hours before my Mahout job is being submitted to my Hadoop cluster. > The Mahout source code shows that the big dataset is downloaded to my local > machine (over wifi, running in vagrant) and centroids are sampled in a > single thread and pushed to hdfs. > To benefit from MapReduce and data locality, I've created a > RandomSeedGeneratorDriver and integrated this in the map reduce version of > (f)kmeans clustering. > This version does the sampling in a few minutes on a small Hadoop cluster. > > If you like, I would be happy to share my code. > > There are several ways to implement this and perhaps you don't favor it’s > current implementation. I'd be happy to discuss this and of course make > changes. > > Kind regards, > Barrie Kersbergen
