GitHub user derrickburns opened a pull request:
https://github.com/apache/spark/pull/2419
This commit addresses SPARK-3218, SPARK-3219, SPARK-3261, and SPARK-3424...
....
This commit introduces a general distance function (PointOps) for the
KMeans clusterer. There are no public API changes. However, the behavior of
the clusterer is different on some tests. Specifically, the handling of empty
clusters has changed. Emply clusters are not filled with random points in this
implementation. The former behavior is undesirable for a number a reasons, not
the least of which is that there is no reasonable use for duplicate cluster
centers. To accommodate the change in behaviour, the test cases were changed
accordingly. This addresses SPARK-3261.
The PointOps trait defines the distance function. The PointOps trait is
more than a simple distance function. It also defines the types of Points and
Centers for the clusterer. Standard MLLIB Vectors are converted into Points
and Centers. In the case of the FastEuclideanOps implementation of PointOps,
the Point and Center types includes vector norm members. In other distance
functions such as the Kullback-Leibler distance function, the Point and Center
types include different values that speed up the distance calculation in a
similar way that caching vector norms speeds up the Euclidean distance
function. This addresses SPARK-3219.
This commit also splits up the clusterer into a number of components which
behave (largely) like their predecessors. KMeansParallel implements the K-Means
|| initialization algorithm. KMeansRandom implements the K-Means Random
initialization algorithm. MultiKMeans implements the general K-Means algorithm
on a given distance function. Traits for the initializer (KMeansInitializer)
and the general algorithm (MultiKMeansClusterer) are provided to highlight the
salient interfaces.
The KMeansPlusPlus implementation was sped up by NOT recomputing distances
to clusters centers that were present in previous steps. This DRAMATICALLY
improves the performance of the K-Means Parallel run. This addresses SPARK-3424.
The next step is to make different distance functions available through the
user-visible API. This commit does NOT do that. Rather, the purpose of this
commit is to introduce an implementation of a generic K-Means clusterer that
uses different distance functions.
The private K-Means constructor which was used by some test Java code was
replaced with a Scala constructor that is not Java friendly. Since the
constructor was not user visible, I simply changed the Java test code to use
the higher level interface.
This code has been tested (albeit while packaged outside of Spark) and
performance measured on data sets of millions of features each with hundreds of
dimensions and on tens of thousands of clusters.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/derrickburns/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2419.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2419
----
commit 9889af70d9571bf8631b6f89a7f824c1d58cd3a4
Author: Derrick Burns <[email protected]>
Date: 2014-09-16T23:00:46Z
This commit addresses SPARK-3218, SPARK-3219, SPARK-3261, and SPARK-3424.
This commit introduces a general distance function (PointOps) for the
KMeans clusterer. There are no public API changes. However, the behavior of
the clusterer is different on some tests. Specifically, the handling of empty
clusters has changed. Emply clusters are not filled with random points in this
implementation. The former behavior is undesirable for a number a reasons, not
the least of which is that there is no reasonable use for duplicate cluster
centers. To accommodate the change in behaviour, the test cases were changed
accordingly. This addresses SPARK-3261.
The PointOps trait defines the distance function. The PointOps trait is
more than a simple distance function. It also defines the types of Points and
Centers for the clusterer. Standard MLLIB Vectors are converted into Points
and Centers. In the case of the FastEuclideanOps implementation of PointOps,
the Point and Center types includes vector norm members. In other distance
functions such as the Kullback-Leibler distance function, the Point and Center
types include different values that speed up the distance calculation in a
similar way that caching vector norms speeds up the Euclidean distance
function. This addresses SPARK-3219.
This commit also splits up the clusterer into a number of components which
behave (largely) like their predecessors. KMeansParallel implements the K-Means
|| initialization algorithm. KMeansRandom implements the K-Means Random
initialization algorithm. MultiKMeans implements the general K-Means algorithm
on a given distance function. Traits for the initializer (KMeansInitializer)
and the general algorithm (MultiKMeansClusterer) are provided to highlight the
salient interfaces.
The KMeansPlusPlus implementation was sped up by NOT recomputing distances
to clusters centers that were present in previous steps. This DRAMATICALLY
improves the performance of the K-Means Parallel run. This addresses SPARK-3424.
The next step is to make different distance functions available through the
user-visible API. This commit does NOT do that. Rather, the purpose of this
commit is to introduce an implementation of a generic K-Means clusterer that
uses different distance functions.
The private K-Means constructor which was used by some test Java code was
replaced with a Scala constructor that is not Java friendly. Since the
constructor was not user visible, I simply changed the Java test code to use
the higher level interface.
This code has been tested (albeit while packaged outside of Spark) and
performance measured on data sets of millions of features each with hundreds of
dimensions and on tens of thousands of clusters.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]