Hi Sparkers,

I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
large K but I've encountered the following issues:


   - Spark driver gets out of memory and dies because collect gets called
   as part of KMeans, which loads all data back to the driver's memory.
   - At the end there is a LocalKMeans class which runs KMeansPlusPlus on
   the Spark driver. Why isn't this distributed? It's spending a long time on
   here and this has the same problem as point 1 requires loading the data to
   the driver.
   Also when LocakKMeans is running on driver also seeing lots of :
   15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
   initialization ran out of distinct points for centers. Using duplicate
   point for center k = 222
   - Has the above behaviour been like this in previous releases? I
   remember running KMeans before without too much problems.

Looking forward to hear you point out my stupidity or provide work-arounds
that could make Spark KMeans work well on large datasets.

Regards,
Sam Stoelinga

Reply via email to