I am not using the "runs" parameter anyway, but I see your point. If you could point out any modifications in the minimal example I posted, I would be more than interested to try them!
On Fri, Sep 2, 2016 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote: > Eh... more specifically, since Spark 2.0 the "runs" parameter in the > KMeans mllib implementation has been ignored and is always 1. This > means a lot of code that wraps this stuff up in arrays could be > simplified quite a lot. I'll take a shot at optimizing this code and > see if I can measure an effect. > > On Fri, Sep 2, 2016 at 6:33 PM, Sean Owen <so...@cloudera.com> wrote: > > Yes it works fine, though each iteration of the parallel init step is > > slow indeed -- about 5 minutes on my cluster. Given your question I > > think you are actually 'hanging' because resources are being killed. > > > > I think this init may need some love and optimization. For example, I > > think treeAggregate might work better. An Array[Float] may be just > > fine and cut down memory usage, etc. > > > > On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras > > <georgesamaras...@gmail.com> wrote: > >> So you were able to execute the minimal example I posted? > >> > >> I mean that the application doesn't progresses, it hangs (I would be OK > if > >> it was just slower). It doesn't seem to me a configuration issue. > >> > >> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen <so...@cloudera.com> wrote: > >>> > >>> Hm, what do you mean? k-means|| init is certainly slower because it's > >>> making passes over the data in order to pick better initial centroids. > >>> The idea is that you might then spend fewer iterations converging > >>> later, and converge to a better clustering. > >>> > >>> Your problem doesn't seem to be related to scale. You aren't even > >>> running out of memory it seems. Your memory settings are causing YARN > >>> to kill the executors for using more memory than they advertise. That > >>> could mean it never proceeds if this happens a lot. > >>> > >>> I don't have any problems with it. > >>> > >>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras > >>> <georgesamaras...@gmail.com> wrote: > >>> > Dear all, > >>> > > >>> > the random initialization works well, but the default > initialization > >>> > is > >>> > k-means|| and has made me struggle. Also, I had heard people one year > >>> > ago > >>> > struggling with it too, and everybody would just skip it and use > random, > >>> > but > >>> > I cannot keep it inside me! > >>> > > >>> > I have posted a minimal example here.. > >>> > > >>> > Please advice, > >>> > George Samaras > >> > >> >