On Tue, 31 May 2016 02:42:03 +0300, Artem Barger wrote:
On Tue, May 31, 2016 at 2:20 AM, Gilles <gil...@harfang.homelinux.org>
wrote:

On Tue, 31 May 2016 01:28:48 +0300, Artem Barger wrote:

Hi,

I've used out of the box current KMeansPlusPlusClusterer implementation provided by CM, however saw that it doesn't scales well on large data
volumes. One of the proposals to improve current implementation was
submitted in JIRA-1330 is to provide support for sparse big data, i.e.
make
clustering algorithm to work w/ SparseVectors.

While working on 1330, I've bumped into: Elkan, Charles. "Using the
triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper
which
described additional possibility of scaling up performance of kmeans
algorithm. Method based on usage of triangle inequality to avoid
unnecessary distance computations cause by small movement of the cluster center which doesn't affect assignment of given point to the cluster.

Simply tests using PerfTestUtils shows that using Elkan's algorithm over the standard provided currently in CM could achieve performance boost in
order of magnitude for significantly large inputs.


Impressive. :-)

I've opened MATH-1371 "Provide accelerated kmeans++ implementation"
https://issues.apache.org/jira/browse/MATH-1371 and attached my
implementation there. Will be glad to receive review and comments about
it.

I believe that switching to this algorithm instead of regular one could
improve quality of CM provided solution for kmeans.


Are these algorithms parallelizable?
Wouldn't such an implementation increase performance even more?


​Yes, you can parallelize it, though it will cancel several optimizations
I've added.
In fact you can partition the input according to number of threads you'd
like to use
and make each thread to take care of relevant data chunk.

I guess it will increase performance, not sure why current implementation
wasn't
parallelized.​


You are most welcome to enhance it. ;-)

Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to