Hi Ted, The streaming k-means in Mahout is very sweet, but I need fuzzy k-means. Converting Mahouts seed into a distributed algorithm allowed me to start fuzzy clustering gigabytes of data in a few seconds instead of hours. Maybe this is something other Mahout users also find interesting.
You can find my changes here: https://github.com/bkersbergen/mahout Kind regards, Barrie Kersbergen 2013/8/15 Ted Dunning <[email protected]> > Look at the streaming k means implementation. This heinous seeding > algorithm goes away entirely. > > Sent from my iPhone > > On Aug 14, 2013, at 13:35, B Kersbergen <[email protected]> wrote: > > > Hi, > > > > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified, > > depending on the characteristics of my dataset it takes about 0.5 to 12 > > hours before my Mahout job is being submitted to my Hadoop cluster. > > The Mahout source code shows that the big dataset is downloaded to my > local > > machine (over wifi, running in vagrant) and centroids are sampled in a > > single thread and pushed to hdfs. > > To benefit from MapReduce and data locality, I've created a > > RandomSeedGeneratorDriver and integrated this in the map reduce version > of > > (f)kmeans clustering. > > This version does the sampling in a few minutes on a small Hadoop > cluster. > > > > If you like, I would be happy to share my code. > > > > There are several ways to implement this and perhaps you don't favor it’s > > current implementation. I'd be happy to discuss this and of course make > > changes. > > > > Kind regards, > > Barrie Kersbergen >
