How you are passing feature vector to K means? its in 2-D space of 1-D array?
Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga <sammiest...@gmail.com> wrote: > Hi Sparkers, > > I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a > large K but I've encountered the following issues: > > > - Spark driver gets out of memory and dies because collect gets called > as part of KMeans, which loads all data back to the driver's memory. > - At the end there is a LocalKMeans class which runs KMeansPlusPlus on > the Spark driver. Why isn't this distributed? It's spending a long time on > here and this has the same problem as point 1 requires loading the data to > the driver. > Also when LocakKMeans is running on driver also seeing lots of : > 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus > initialization ran out of distinct points for centers. Using duplicate > point for center k = 222 > - Has the above behaviour been like this in previous releases? I > remember running KMeans before without too much problems. > > Looking forward to hear you point out my stupidity or provide work-arounds > that could make Spark KMeans work well on large datasets. > > Regards, > Sam Stoelinga >