Re: MLib KMeans on large dataset issues

Jeetendra Gangele Wed, 29 Apr 2015 05:03:08 -0700

How you are passing feature vector to K means?
its in 2-D space of 1-D array?


Did you try using Streaming Kmeans?

will you be able to paste code here?

On 29 April 2015 at 17:23, Sam Stoelinga <sammiest...@gmail.com> wrote:

> Hi Sparkers,
>
> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
> large K but I've encountered the following issues:
>
>
>    - Spark driver gets out of memory and dies because collect gets called
>    as part of KMeans, which loads all data back to the driver's memory.
>    - At the end there is a LocalKMeans class which runs KMeansPlusPlus on
>    the Spark driver. Why isn't this distributed? It's spending a long time on
>    here and this has the same problem as point 1 requires loading the data to
>    the driver.
>    Also when LocakKMeans is running on driver also seeing lots of :
>    15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
>    initialization ran out of distinct points for centers. Using duplicate
>    point for center k = 222
>    - Has the above behaviour been like this in previous releases? I
>    remember running KMeans before without too much problems.
>
> Looking forward to hear you point out my stupidity or provide work-arounds
> that could make Spark KMeans work well on large datasets.
>
> Regards,
> Sam Stoelinga
>

Re: MLib KMeans on large dataset issues

Reply via email to