Look at the streaming k means implementation.  This heinous seeding algorithm 
goes away entirely.  

Sent from my iPhone

On Aug 14, 2013, at 13:35, B Kersbergen <[email protected]> wrote:

> Hi,
> 
> When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
> depending on the characteristics of my dataset it takes about 0.5 to 12
> hours before my Mahout job is being submitted to my Hadoop cluster.
> The Mahout source code shows that the big dataset is downloaded to my local
> machine (over wifi, running in vagrant) and centroids are sampled in a
> single thread and pushed to hdfs.
> To benefit from MapReduce and data locality, I've created a
> RandomSeedGeneratorDriver and integrated this in the map reduce version of
> (f)kmeans clustering.
> This version does the sampling in a few minutes on a small Hadoop cluster.
> 
> If you like, I would be happy to share my code.
> 
> There are several ways to implement this and perhaps you don't favor it’s
> current implementation. I'd be happy to discuss this and of course make
> changes.
> 
> Kind regards,
> Barrie Kersbergen

Reply via email to