On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> wrote:
> Hi, > > Current implementation of kmeans within CM framework, inherently uses > algorithm published by Arthur, David, and Sergei Vassilvitskii. > "k-means++: The advantages of careful seeding." *Proceedings of the > eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for > Industrial and Applied Mathematics, 2007. While there other alternative > algorithms for initial seeding is available, for instance: > > 1. Random initialization (each center picked uniformly at random). > 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm > 3. Bicriteria Feldman, Dan, et al. "Bi-criteria linear-time > approximations for generalized k-mean/median/center." *Proceedings of the > twenty-third annual symposium on Computational geometry*. ACM, 2007. > > While I understand that kmeans++ is preferable option, others could be > also used for testing, trials and evaluations as well. > > I'd like to propose to separate logic of seeding and clustering to > increase flexibility for kmeans clustering. Would be glad to hear your > comments, pros/cons or rejections... > > I've found "Scalable KMeans" or kmeans|| as referred in the http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which provides parallelizable seeding procedure. I guess this might serve as additional +1 vote for doing separation between seeding and LLoyd's iterations in current implementations of kmeans. Best, Artem Barger.