GitHub user derrickburns opened a pull request:

    https://github.com/apache/spark/pull/2419

    This commit addresses SPARK-3218, SPARK-3219, SPARK-3261, and SPARK-3424...

    ....
    
    This commit introduces a general distance function (PointOps) for the 
KMeans clusterer. There are no public API changes.  However, the behavior of 
the clusterer is different on some tests.  Specifically, the handling of empty 
clusters has changed.  Emply clusters are not filled with random points in this 
implementation.  The former behavior is undesirable for a number a reasons, not 
the least of which is that there is no reasonable use for duplicate cluster 
centers. To accommodate the change in behaviour, the test cases were changed 
accordingly. This addresses SPARK-3261.
    
    The PointOps trait defines the distance function.  The PointOps trait is 
more than a simple distance function.  It also defines the types of Points and 
Centers for the clusterer.  Standard MLLIB Vectors are converted into Points 
and Centers.  In the case of the FastEuclideanOps implementation of PointOps, 
the Point and Center types includes vector norm members. In other distance 
functions such as the Kullback-Leibler distance function, the Point and Center 
types include different values that speed up the distance calculation in a 
similar way that caching vector norms speeds up the Euclidean distance 
function.  This addresses SPARK-3219.
    
    This commit also splits up the clusterer into a number of components which 
behave (largely) like their predecessors. KMeansParallel implements the K-Means 
|| initialization algorithm.  KMeansRandom implements the K-Means Random 
initialization algorithm.  MultiKMeans implements the general K-Means algorithm 
on a given distance function.  Traits for the initializer (KMeansInitializer) 
and the general algorithm (MultiKMeansClusterer) are provided to highlight the 
salient interfaces.
    
    The KMeansPlusPlus implementation was sped up by NOT recomputing distances 
to clusters centers that were present in previous steps.  This DRAMATICALLY 
improves the performance of the K-Means Parallel run. This addresses SPARK-3424.
    
    The next step is to make different distance functions available through the 
user-visible API.  This commit does NOT do that.  Rather, the purpose of this 
commit is to introduce an implementation of a generic K-Means clusterer that 
uses different distance functions.
    
    The private K-Means constructor which was used by some test Java code was 
replaced with a Scala constructor that is not Java friendly.  Since the 
constructor was not user visible, I simply changed the Java test code to use 
the higher level interface.
    
    This code has been tested (albeit while packaged outside of Spark) and 
performance measured on data sets of millions of features each with hundreds of 
dimensions and on tens of thousands of clusters.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/derrickburns/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2419
    
----
commit 9889af70d9571bf8631b6f89a7f824c1d58cd3a4
Author: Derrick Burns <[email protected]>
Date:   2014-09-16T23:00:46Z

    This commit addresses SPARK-3218, SPARK-3219, SPARK-3261, and SPARK-3424.
    
    This commit introduces a general distance function (PointOps) for the 
KMeans clusterer. There are no public API changes.  However, the behavior of 
the clusterer is different on some tests.  Specifically, the handling of empty 
clusters has changed.  Emply clusters are not filled with random points in this 
implementation.  The former behavior is undesirable for a number a reasons, not 
the least of which is that there is no reasonable use for duplicate cluster 
centers. To accommodate the change in behaviour, the test cases were changed 
accordingly. This addresses SPARK-3261.
    
    The PointOps trait defines the distance function.  The PointOps trait is 
more than a simple distance function.  It also defines the types of Points and 
Centers for the clusterer.  Standard MLLIB Vectors are converted into Points 
and Centers.  In the case of the FastEuclideanOps implementation of PointOps, 
the Point and Center types includes vector norm members. In other distance 
functions such as the Kullback-Leibler distance function, the Point and Center 
types include different values that speed up the distance calculation in a 
similar way that caching vector norms speeds up the Euclidean distance 
function.  This addresses SPARK-3219.
    
    This commit also splits up the clusterer into a number of components which 
behave (largely) like their predecessors. KMeansParallel implements the K-Means 
|| initialization algorithm.  KMeansRandom implements the K-Means Random 
initialization algorithm.  MultiKMeans implements the general K-Means algorithm 
on a given distance function.  Traits for the initializer (KMeansInitializer) 
and the general algorithm (MultiKMeansClusterer) are provided to highlight the 
salient interfaces.
    
    The KMeansPlusPlus implementation was sped up by NOT recomputing distances 
to clusters centers that were present in previous steps.  This DRAMATICALLY 
improves the performance of the K-Means Parallel run. This addresses SPARK-3424.
    
    The next step is to make different distance functions available through the 
user-visible API.  This commit does NOT do that.  Rather, the purpose of this 
commit is to introduce an implementation of a generic K-Means clusterer that 
uses different distance functions.
    
    The private K-Means constructor which was used by some test Java code was 
replaced with a Scala constructor that is not Java friendly.  Since the 
constructor was not user visible, I simply changed the Java test code to use 
the higher level interface.
    
    This code has been tested (albeit while packaged outside of Spark) and 
performance measured on data sets of millions of features each with hundreds of 
dimensions and on tens of thousands of clusters.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to