On Tue, May 31, 2016 at 2:20 AM, Gilles <gil...@harfang.homelinux.org> wrote:
> On Tue, 31 May 2016 01:28:48 +0300, Artem Barger wrote: > >> Hi, >> >> I've used out of the box current KMeansPlusPlusClusterer implementation >> provided by CM, however saw that it doesn't scales well on large data >> volumes. One of the proposals to improve current implementation was >> submitted in JIRA-1330 is to provide support for sparse big data, i.e. >> make >> clustering algorithm to work w/ SparseVectors. >> >> While working on 1330, I've bumped into: Elkan, Charles. "Using the >> triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper >> which >> described additional possibility of scaling up performance of kmeans >> algorithm. Method based on usage of triangle inequality to avoid >> unnecessary distance computations cause by small movement of the cluster >> center which doesn't affect assignment of given point to the cluster. >> >> Simply tests using PerfTestUtils shows that using Elkan's algorithm over >> the standard provided currently in CM could achieve performance boost in >> order of magnitude for significantly large inputs. >> > > Impressive. :-) > > I've opened MATH-1371 "Provide accelerated kmeans++ implementation" >> https://issues.apache.org/jira/browse/MATH-1371 and attached my >> implementation there. Will be glad to receive review and comments about >> it. >> >> I believe that switching to this algorithm instead of regular one could >> improve quality of CM provided solution for kmeans. >> > > Are these algorithms parallelizable? > Wouldn't such an implementation increase performance even more? > > Yes, you can parallelize it, though it will cancel several optimizations I've added. In fact you can partition the input according to number of threads you'd like to use and make each thread to take care of relevant data chunk. I guess it will increase performance, not sure why current implementation wasn't parallelized. > Regards, > Gilles > > >> Best regards, >> Artem Barger. >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >