GitHub user mouendless opened a pull request:

    https://github.com/apache/spark/pull/13047

    Update LocalKMeans.scala

    ## Proposed Change
    
    After picking each new initial centers, it's unnecessary to compute the 
distances between all the points and the old ones. 
    Instead this change keep the distance between all the points and their 
closest centers, and compare to the distance of them with the new center then 
update them.
    
    
    ## Test result
    
    I test the KMeans++ method for a small dataset with 16k points, and the 
whole KMeans|| with a large one with 240k points.
    The data has 4096 features and I tunes K from 100 to 500.
    The test environment was on my 4 machine cluster, I also tested a 3M points 
data on a larger cluster with 25 machines and got similar results, which I 
would not draw the detail curve. The result of the first two exps are shown 
below
    
    ###Local KMeans++ test:
    
    Dataset             Data size       Dimension       Data type
    4m_ini_center       16234   4096            Dense
    
    Lloyd's Iteration = 10
    The y-axis is time in sec, the x-axis is tuning the K.
    
    
![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)
    
    
![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)
    
    ###On a larger dataset
    
    An improve show in the graph but not commit in this file: In this 
experiment I also have an improvement for calculation in normalization data 
(the distance is convert to the cosine distance). As if the data is normalized 
into (0,1), one improvement in the original vesion for 
util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * 
EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then 
precision in this case). Therefore I design an early terminal method when 
comparing two distance (used for findClosest). But I don't include this improve 
in this file, you may only refer to the curves without "normalize" for 
comparing the results.
    
    
    
    Dataset     Data size       Dimension       Data type
    4k24        243960  4096    Sparse
    
    
    Normlize    Enlarge         Initialize      Lloyd's Iteration
    NO          1                3                5
    YES                 10000    3                5
    
    Notice: the normlized data is enlarged to ensure precision
    
    The cost time: x-for value of K, y-for time in sec
    
![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)
    
    SE for unnormalized data between two version, to ensure the correctness
    
![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)
    
    Here is the SE between normalized data just for reference, it's also 
correct.
    
![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)
    
    
    
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mouendless/spark patch-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13047
    
----
commit 4e5bc1d8cbbce78d00c54c5b633c30f0799aabb5
Author: DLucky <[email protected]>
Date:   2016-05-11T09:03:50Z

    Update LocalKMeans.scala
    
    After picking each new initial centers, it's unnecessary to compute the 
distances between all the points and the old ones. Instead we can keep the 
distance between all the points and their closest centers, and compare to the 
distance of them with the new center then update them.
    
    Experiment:
    I test the KMeans method for 243960 data with 4096 features and tunes K 
from 100 to 500 on my 4 machine cluster.
    The result shows that in the beginning when K is small, the performance are 
roughly the same, but when K increase, the original one perform much slower, 
for K=500, it takes more than 30000sec while the improved version only takes 
less than 5000sec.
    I print the time consuming fraction, it shows that most of the time of the 
old version is doing LocalKMeans.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to