Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues:
- Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end there is a LocalKMeans class which runs KMeansPlusPlus on the Spark driver. Why isn't this distributed? It's spending a long time on here and this has the same problem as point 1 requires loading the data to the driver. Also when LocakKMeans is running on driver also seeing lots of : 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus initialization ran out of distinct points for centers. Using duplicate point for center k = 222 - Has the above behaviour been like this in previous releases? I remember running KMeans before without too much problems. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga