Thanks for your response, I will have a look at it's implementation. regards, Barrie
2013/8/15 Ted Dunning <[email protected]> > Look at the streaming k means implementation. This heinous seeding > algorithm goes away entirely. > > Sent from my iPhone > > On Aug 14, 2013, at 13:35, B Kersbergen <[email protected]> wrote: > > > Hi, > > > > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified, > > depending on the characteristics of my dataset it takes about 0.5 to 12 > > hours before my Mahout job is being submitted to my Hadoop cluster. > > The Mahout source code shows that the big dataset is downloaded to my > local > > machine (over wifi, running in vagrant) and centroids are sampled in a > > single thread and pushed to hdfs. > > To benefit from MapReduce and data locality, I've created a > > RandomSeedGeneratorDriver and integrated this in the map reduce version > of > > (f)kmeans clustering. > > This version does the sampling in a few minutes on a small Hadoop > cluster. > > > > If you like, I would be happy to share my code. > > > > There are several ways to implement this and perhaps you don't favor it’s > > current implementation. I'd be happy to discuss this and of course make > > changes. > > > > Kind regards, > > Barrie Kersbergen >
