Re: distributed RandomSeedGenerator

B Kersbergen Thu, 22 Aug 2013 07:58:23 -0700

Hi Ted,

The streaming k-means in Mahout is very sweet, but I need fuzzy k-means.
Converting Mahouts seed into a distributed algorithm allowed me to start
fuzzy clustering gigabytes of data in a few seconds instead of hours.
Maybe this is something other Mahout users also find interesting.


You can find my changes here:
https://github.com/bkersbergen/mahout

Kind regards,
Barrie Kersbergen



2013/8/15 Ted Dunning <[email protected]>

> Look at the streaming k means implementation.  This heinous seeding
> algorithm goes away entirely.
>
> Sent from my iPhone
>
> On Aug 14, 2013, at 13:35, B Kersbergen <[email protected]> wrote:
>
> > Hi,
> >
> > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
> > depending on the characteristics of my dataset it takes about 0.5 to 12
> > hours before my Mahout job is being submitted to my Hadoop cluster.
> > The Mahout source code shows that the big dataset is downloaded to my
> local
> > machine (over wifi, running in vagrant) and centroids are sampled in a
> > single thread and pushed to hdfs.
> > To benefit from MapReduce and data locality, I've created a
> > RandomSeedGeneratorDriver and integrated this in the map reduce version
> of
> > (f)kmeans clustering.
> > This version does the sampling in a few minutes on a small Hadoop
> cluster.
> >
> > If you like, I would be happy to share my code.
> >
> > There are several ways to implement this and perhaps you don't favor it’s
> > current implementation. I'd be happy to discuss this and of course make
> > changes.
> >
> > Kind regards,
> > Barrie Kersbergen
>

Re: distributed RandomSeedGenerator

Reply via email to