GitHub user derrickburns opened a pull request:

    https://github.com/apache/spark/pull/2634

    [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means 
Clusterer

    This commit introduces a general distance function trait, `PointOps`, for 
the Spark K-Means clusterer. There are no public API changes*.  
    
    ### Issue - Data Capture Test Fails - NEED HELP
    
    The `org.apache.spark.mllib.clustering.KMeansClusterSuite` "task size 
should be small in both training and prediction" fails, suggesting that the RDD 
data is being captured in a closure.  This is quite puzzling. My efforts to 
solve this problem have failed. I *need help* to solve this problem. 
    
    ### Distance Function Trait
    
    The `PointOps` trait defines the distance function.  The `PointOps` trait 
is more than a simple distance function.  It also defines the types of Points 
and Centers for the clusterer.  Standard MLLIB `Vector`s are converted into 
Points and Centers.  In the case of the `FastEuclideanOps` implementation of 
`PointOps`, the Point and Center types includes vector norm members. In other 
distance functions such as the [Kullback-Leibler distance 
function](http://en.wikipedia.org/wiki/Bregman_divergence), the Point and 
Center types include different values that speed up the distance calculation in 
a similar way that caching vector norms speeds up the Euclidean distance 
function.  This addresses SPARK-3219.
    
    ### Refactoring
    
    To understand this original code, I found it useful to refactor the 
original implementation into components.  You may find it helpful to understand 
this pull request by looking at the new components and comparing them to their 
original implementation.  Unfortunately, GitHub diff does not help very much 
with this.
    
    This commit splits up the clusterer into a number of components which 
behave (largely) like their predecessors. `KMeansParallel` implements the 
K-Means || initialization algorithm.  `KMeansRandom` implements the K-Means 
Random initialization algorithm.  `MultiKMeans` implements the K-Means 
algorithm on multiple sets of cluster centers using a given distance function.  
Traits for the initializer, `KMeansInitializer`, and the general K-Means 
clusterer, `MultiKMeansClusterer`, are provided to highlight the salient 
interfaces with the intent that alternative implementations of these interfaces 
may be provided in the future.
    
    ### Performance
    
    This pull request is not focused on performance. Nevertheless,  the 
performance of the KMeans++ implementation was *dramatically* improved by NOT 
recomputing distances to clusters centers that were present in previous steps.  
This turns a quadratic implementation into a linear one. 
    
    Second, the KMeans++ implementation uses the general K-Means clusterer in 
the final step.  This parallelizes a step that was sequential.
    
    Together, these changes address SPARK-3424.
    
    ### Next Steps
    
    This pull request does not introduce new user-visible changes.  The next 
step is to make different distance functions available through a user-visible 
API.   I will provide other distance functions after this pull request has been 
accepted. Then, we can decide on an appropriate user-level API to access those 
functions.
    
    ### Compatibility
    
    While there are no user-level API changes, the behavior of the clusterer is 
*different* on some tests.  Specifically, the handling of empty clusters has 
changed.  Empty clusters are not filled with random points in this 
implementation.  The former behavior is undesirable for a number a reasons, not 
the least of which is that there is no reasonable use for duplicate cluster 
centers. To accommodate the change in behavior, the test cases were changed 
accordingly. This addresses SPARK-3261.
    
    The private K-Means constructor which was used by some test Java code and 
one example was replaced with a Scala constructor that is not Java friendly.  
Since the constructor was not user visible, I simply changed the Java test code 
and the example to use the higher level interface.
    
    ### Testing
    
    This code has been tested (albeit while packaged outside of Spark) and 
performance measured on data sets of millions of features each with hundreds of 
dimensions and on tens of thousands of clusters.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/derrickburns/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2634.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2634
    
----
commit fbfdcd8946d8681c062ba08d71eac961db50417d
Author: Derrick Burns <[email protected]>
Date:   2014-10-02T20:01:18Z

    This commit introduces a general distance function trait, PointOps, for the 
Spark K-Means clusterer. There are no public API changes*.
    
    Distance Function Trait
    
    The PointOps trait defines the distance function. The PointOps trait is 
more than a simple distance function. It also defines the types of Points and 
Centers for the clusterer. Standard MLLIB Vectors are converted into Points and 
Centers. In the case of the FastEuclideanOps implementation of PointOps, the 
Point and Center types includes vector norm members. In other distance 
functions such as the Kullback-Leibler distance function, the Point and Center 
types include different values that speed up the distance calculation in a 
similar way that caching vector norms speeds up the Euclidean distance 
function. This addresses SPARK-3219.
    
    Refactoring
    
    To understand this original code, I found it useful to refactor the 
original implementation into components. You may find it helpful to understand 
this pull request by looking at the new components and comparing them to their 
original implementation. Unfortunately, GitHub diff does not help very much 
with this.
    
    This commit splits up the clusterer into a number of components which 
behave (largely) like their predecessors. KMeansParallel implements the K-Means 
|| initialization algorithm. KMeansRandom implements the K-Means Random 
initialization algorithm. MultiKMeans implements the K-Means algorithm on 
multiple sets of cluster centers using a given distance function. Traits for 
the initializer, KMeansInitializer, and the general K-Means clusterer, 
MultiKMeansClusterer, are provided to highlight the salient interfaces with the 
intent that alternative implementations of these interfaces may be provided in 
the future.
    
    Performance
    
    This pull request is not focused on performance. Nevertheless, the 
performance of the KMeans++ implementation was dramatically improved by NOT 
recomputing distances to clusters centers that were present in previous steps. 
This turns a quadratic implementation into a linear one.
    
    Second, the KMeans++ implementation uses the general K-Means clusterer in 
the final step. This parallelizes a step that was sequential.
    
    Together, these changes address SPARK-3424.
    
    Next Steps
    
    This pull request does not introduce new user-visible changes. The next 
step is to make different distance functions available through a user-visible 
API. I will provide other distance functions after this pull request has been 
accepted. Then, we can decide on an appropriate user-level API to access those 
functions.
    
    Compatibility
    
    While there are no user-level API changes, the behavior of the clusterer is 
different on some tests. Specifically, the handling of empty clusters has 
changed. Empty clusters are not filled with random points in this 
implementation. The former behavior is undesirable for a number a reasons, not 
the least of which is that there is no reasonable use for duplicate cluster 
centers. To accommodate the change in behavior, the test cases were changed 
accordingly. This addresses SPARK-3261.
    
    The private K-Means constructor which was used by some test Java code and 
one example was replaced with a Scala constructor that is not Java friendly. 
Since the constructor was not user visible, I simply changed the Java test code 
and the example to use the higher level interface.
    
    Testing
    
    This code has been tested (albeit while packaged outside of Spark) and 
performance measured on data sets of millions of features each with hundreds of 
dimensions and on tens of thousands of clusters.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to