GitHub user derrickburns opened a pull request:
https://github.com/apache/spark/pull/2634
[SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means
Clusterer
This commit introduces a general distance function trait, `PointOps`, for
the Spark K-Means clusterer. There are no public API changes*.
### Issue - Data Capture Test Fails - NEED HELP
The `org.apache.spark.mllib.clustering.KMeansClusterSuite` "task size
should be small in both training and prediction" fails, suggesting that the RDD
data is being captured in a closure. This is quite puzzling. My efforts to
solve this problem have failed. I *need help* to solve this problem.
### Distance Function Trait
The `PointOps` trait defines the distance function. The `PointOps` trait
is more than a simple distance function. It also defines the types of Points
and Centers for the clusterer. Standard MLLIB `Vector`s are converted into
Points and Centers. In the case of the `FastEuclideanOps` implementation of
`PointOps`, the Point and Center types includes vector norm members. In other
distance functions such as the [Kullback-Leibler distance
function](http://en.wikipedia.org/wiki/Bregman_divergence), the Point and
Center types include different values that speed up the distance calculation in
a similar way that caching vector norms speeds up the Euclidean distance
function. This addresses SPARK-3219.
### Refactoring
To understand this original code, I found it useful to refactor the
original implementation into components. You may find it helpful to understand
this pull request by looking at the new components and comparing them to their
original implementation. Unfortunately, GitHub diff does not help very much
with this.
This commit splits up the clusterer into a number of components which
behave (largely) like their predecessors. `KMeansParallel` implements the
K-Means || initialization algorithm. `KMeansRandom` implements the K-Means
Random initialization algorithm. `MultiKMeans` implements the K-Means
algorithm on multiple sets of cluster centers using a given distance function.
Traits for the initializer, `KMeansInitializer`, and the general K-Means
clusterer, `MultiKMeansClusterer`, are provided to highlight the salient
interfaces with the intent that alternative implementations of these interfaces
may be provided in the future.
### Performance
This pull request is not focused on performance. Nevertheless, the
performance of the KMeans++ implementation was *dramatically* improved by NOT
recomputing distances to clusters centers that were present in previous steps.
This turns a quadratic implementation into a linear one.
Second, the KMeans++ implementation uses the general K-Means clusterer in
the final step. This parallelizes a step that was sequential.
Together, these changes address SPARK-3424.
### Next Steps
This pull request does not introduce new user-visible changes. The next
step is to make different distance functions available through a user-visible
API. I will provide other distance functions after this pull request has been
accepted. Then, we can decide on an appropriate user-level API to access those
functions.
### Compatibility
While there are no user-level API changes, the behavior of the clusterer is
*different* on some tests. Specifically, the handling of empty clusters has
changed. Empty clusters are not filled with random points in this
implementation. The former behavior is undesirable for a number a reasons, not
the least of which is that there is no reasonable use for duplicate cluster
centers. To accommodate the change in behavior, the test cases were changed
accordingly. This addresses SPARK-3261.
The private K-Means constructor which was used by some test Java code and
one example was replaced with a Scala constructor that is not Java friendly.
Since the constructor was not user visible, I simply changed the Java test code
and the example to use the higher level interface.
### Testing
This code has been tested (albeit while packaged outside of Spark) and
performance measured on data sets of millions of features each with hundreds of
dimensions and on tens of thousands of clusters.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/derrickburns/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2634.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2634
----
commit fbfdcd8946d8681c062ba08d71eac961db50417d
Author: Derrick Burns <[email protected]>
Date: 2014-10-02T20:01:18Z
This commit introduces a general distance function trait, PointOps, for the
Spark K-Means clusterer. There are no public API changes*.
Distance Function Trait
The PointOps trait defines the distance function. The PointOps trait is
more than a simple distance function. It also defines the types of Points and
Centers for the clusterer. Standard MLLIB Vectors are converted into Points and
Centers. In the case of the FastEuclideanOps implementation of PointOps, the
Point and Center types includes vector norm members. In other distance
functions such as the Kullback-Leibler distance function, the Point and Center
types include different values that speed up the distance calculation in a
similar way that caching vector norms speeds up the Euclidean distance
function. This addresses SPARK-3219.
Refactoring
To understand this original code, I found it useful to refactor the
original implementation into components. You may find it helpful to understand
this pull request by looking at the new components and comparing them to their
original implementation. Unfortunately, GitHub diff does not help very much
with this.
This commit splits up the clusterer into a number of components which
behave (largely) like their predecessors. KMeansParallel implements the K-Means
|| initialization algorithm. KMeansRandom implements the K-Means Random
initialization algorithm. MultiKMeans implements the K-Means algorithm on
multiple sets of cluster centers using a given distance function. Traits for
the initializer, KMeansInitializer, and the general K-Means clusterer,
MultiKMeansClusterer, are provided to highlight the salient interfaces with the
intent that alternative implementations of these interfaces may be provided in
the future.
Performance
This pull request is not focused on performance. Nevertheless, the
performance of the KMeans++ implementation was dramatically improved by NOT
recomputing distances to clusters centers that were present in previous steps.
This turns a quadratic implementation into a linear one.
Second, the KMeans++ implementation uses the general K-Means clusterer in
the final step. This parallelizes a step that was sequential.
Together, these changes address SPARK-3424.
Next Steps
This pull request does not introduce new user-visible changes. The next
step is to make different distance functions available through a user-visible
API. I will provide other distance functions after this pull request has been
accepted. Then, we can decide on an appropriate user-level API to access those
functions.
Compatibility
While there are no user-level API changes, the behavior of the clusterer is
different on some tests. Specifically, the handling of empty clusters has
changed. Empty clusters are not filled with random points in this
implementation. The former behavior is undesirable for a number a reasons, not
the least of which is that there is no reasonable use for duplicate cluster
centers. To accommodate the change in behavior, the test cases were changed
accordingly. This addresses SPARK-3261.
The private K-Means constructor which was used by some test Java code and
one example was replaced with a Scala constructor that is not Java friendly.
Since the constructor was not user visible, I simply changed the Java test code
and the example to use the higher level interface.
Testing
This code has been tested (albeit while packaged outside of Spark) and
performance measured on data sets of millions of features each with hundreds of
dimensions and on tens of thousands of clusters.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]