GitHub user mouendless opened a pull request:
https://github.com/apache/spark/pull/13047
Update LocalKMeans.scala
## Proposed Change
After picking each new initial centers, it's unnecessary to compute the
distances between all the points and the old ones.
Instead this change keep the distance between all the points and their
closest centers, and compare to the distance of them with the new center then
update them.
## Test result
I test the KMeans++ method for a small dataset with 16k points, and the
whole KMeans|| with a large one with 240k points.
The data has 4096 features and I tunes K from 100 to 500.
The test environment was on my 4 machine cluster, I also tested a 3M points
data on a larger cluster with 25 machines and got similar results, which I
would not draw the detail curve. The result of the first two exps are shown
below
###Local KMeans++ test:
Dataset Data size Dimension Data type
4m_ini_center 16234 4096 Dense
Lloyd's Iteration = 10
The y-axis is time in sec, the x-axis is tuning the K.


###On a larger dataset
An improve show in the graph but not commit in this file: In this
experiment I also have an improvement for calculation in normalization data
(the distance is convert to the cosine distance). As if the data is normalized
into (0,1), one improvement in the original vesion for
util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 *
EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then
precision in this case). Therefore I design an early terminal method when
comparing two distance (used for findClosest). But I don't include this improve
in this file, you may only refer to the curves without "normalize" for
comparing the results.
Dataset Data size Dimension Data type
4k24 243960 4096 Sparse
Normlize Enlarge Initialize Lloyd's Iteration
NO 1 3 5
YES 10000 3 5
Notice: the normlized data is enlarged to ensure precision
The cost time: x-for value of K, y-for time in sec

SE for unnormalized data between two version, to ensure the correctness

Here is the SE between normalized data just for reference, it's also
correct.

You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mouendless/spark patch-1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13047.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13047
----
commit 4e5bc1d8cbbce78d00c54c5b633c30f0799aabb5
Author: DLucky <[email protected]>
Date: 2016-05-11T09:03:50Z
Update LocalKMeans.scala
After picking each new initial centers, it's unnecessary to compute the
distances between all the points and the old ones. Instead we can keep the
distance between all the points and their closest centers, and compare to the
distance of them with the new center then update them.
Experiment:
I test the KMeans method for 243960 data with 4096 features and tunes K
from 100 to 500 on my 4 machine cluster.
The result shows that in the beginning when K is small, the performance are
roughly the same, but when K increase, the original one perform much slower,
for K=500, it takes more than 30000sec while the improved version only takes
less than 5000sec.
I print the time consuming fraction, it shows that most of the time of the
old version is doing LocalKMeans.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]