[
https://issues.apache.org/jira/browse/MATH-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058725#comment-17058725
]
Alex Herbert edited comment on MATH-1524 at 3/13/20, 1:45 PM:
--------------------------------------------------------------
{quote}Is the distance between a point and a cluster always defined as the
distance between that point and the cluster's centroid?
{quote}
Depends what you are doing. For hierarchical clustering you have lots of
options, see this:
[Linkage
criteria|https://en.wikipedia.org/wiki/Hierarchical_clustering#Linkage_criteria]
The choice of linkage criteria and the clustering algorithm are not always
codependent. Choice of linkage can effect results. Centroid linkage is nice for
n-spheres. When you have other shapes such as toroids then it is not suitable.
For DBSCAN linkage is a type of single-linkage clustering. The algorithm can be
extremely fast if designed on top of efficient data structures for
neighbourhood search such as [KD trees|https://en.wikipedia.org/wiki/K-d_tree]
or grid based storage. This is only possible for certain assumption about the
data for example if each dimension is weighted to be equal and the distance
metric is the Euclidean norm. I note that the DBSCAN algorithm in Math is not
optimal. It could be improved a lot with different data structures to store the
processed items to eliminate use of Set and HashMap.
With regard to your use case the predict phase would involve searching against
all clusters. As such this would be more easily optimised if the results of the
training phase was not a list of clusters but instead a ClusteringResult
object. This would be free to return an object containing not only the list of
clusters but also any optimised data structures currently created during the
clustering algorithm. The ClusteringResult should have a predict method to
allow prediction of which cluster a new point should be assigned to.
This would be a change to the entire Clusterer API to return something that can
predict membership of new points. And given that there are a lot of ways to do
this (e.g. centroid linkage, single linkage, etc) the API would need a bit of
work.
It may be simpler to continue to return the list of clusters and then have a
ClusterPredictor interface that can be built using the the output of
clustering. You can then build this however you want. In this case you would
require a CentroidClusterPredictor:
{code:java}
@FunctionalInterface
public interface ClusterPredictor<T extends Clusterable> {
/**
* Predict the cluster membership of the given {@link Clusterable} instance.
*
* @param point the {@link Clusterable} instance
* @return the predicted cluster
*/
Cluster<T> predict(T point);
// E.g. for centroids (this is not optimal but you get the idea)
static <T extends Clusterable> ClusterPredictor<T> centroidPredictor(
List<? extends Cluster<T>> clusters, DistanceMeasure measure) {
List<double[]> centroids = clusters.stream()
.map(Cluster::centroid)
.map(Clusterable::getPoint)
.collect(Collectors.toList());
return new ClusterPredictor<T>() {
@Override
public Cluster<T> predict(T point) {
double[] a = point.getPoint();
double min = measure.compute(a, centroids.get(0));
int mini = 0;
for (int i = 1; i < centroids.size(); i++) {
double d = measure.compute(a, centroids.get(i));
if (d < min) {
min = d;
mini = i;
}
}
return clusters.get(mini);
}
};
}
}{code}
However once you have identified the cluster you still need an API to find the
closest points in the cluster.
The solution suggested to find the closest point would fit your use case for
centroid linkage prediction (modified) if it was modified to find the N closest:
{code:java}
Clusterable pointToPredict = ...
int topN = ...
// Search in all cluster centroids
// topN = 1
int[] bestClusterIndexes = aUtilClass.closestIndexes(clusters, measure,
pointToPredict, 1);
Cluster bestCluster = clusters.get(bestClusterIndexes [0]);
// search in the best cluster
int[] closestPointsIndexes =
aUtilClass.closestIndexes(bestCluster.getPoints(), measure, pointToPredict,
topN);
List<Clusterable> closestPoints = Arrays.stream(closestPointsIndexes )
.map(i ->
bestCluster.getPoints().get(i))
.collect(Collectors.toList());
{code}
Becomes:
{code:java}
public interface Clusterable {
/**
* @param list List of points.
* @param dist Distance measure.
* @param count The number of closest points.
* @return the points from the {@code list} that are closest to this point.
*/
default <T extends Clusterable> List<T> closest(Collection<T> list,
DistanceMeasure dist,
int count) {
// ...
}
}
Clusterable pointToPredict = ...
List<Clusterable> centroids = ...
Clusterable bestCluster = pointToPredict.closest(centroids, measure,
1).get(0);
int topN = ...
List<Clusterable> closestClusterPoints =
pointToPredict.closest(/* bestCluster members */, topN);
{code}
You still have to be able to convert the best cluster centroid into a list of
cluster members. But given that you have to create the centroids for each
Cluster you can just create centroid as a custom class that implements
Clusterable and stores a reference to the cluster members.
If the API returns the top N closest then you have to decide if the returned
collection is sorted. The top N can be stored in a
[Heap|https://en.wikipedia.org/wiki/Heap_(data_structure)] for efficient
search. If you forgo the sorted requirement you can return the heap contents
directly (the heap will not be strictly ordered). Otherwise you would have to
sort the heap at the end of the search.
PS. The newly added ClusterEvaluator has no main javadoc.
was (Author: alexherbert):
{quote}Is the distance between a point and a cluster always defined as the
distance between that point and the cluster's centroid?
{quote}
Depends what you are doing. For hierarchical clustering you have lots of
options, see this:
[Linkage
criteria|https://en.wikipedia.org/wiki/Hierarchical_clustering#Linkage_criteria]
The choice of linkage criteria and the clustering algorithm are not always
codependent. Choice of linkage can effect results. Centroid linkage is nice for
n-spheres. When you have other shapes such as toroids then it is not suitable.
For DBSCAN linkage is a type of single-linkage clustering. The algorithm can be
extremely fast if designed on top of efficient data structures for
neighbourhood search such as [KD trees|https://en.wikipedia.org/wiki/K-d_tree]
or grid based storage. This is only possible for certain assumption about the
data for example if each dimension is weighted to be equal and the distance
metric is the Euclidean norm. I note that the DBSCAN algorithm in Math is not
optimal. It could be improved a lot with different data structures to store the
processed items to eliminate use of Set and HashMap.
With regard to your use case the predict phase would involve searching against
all clusters. As such this would be more easily optimised if the results of the
training phase was not a list of clusters but instead a ClusteringResult
object. This would be free to return an object containing not only the list of
clusters but also any optimised data structures currently created during the
clustering algorithm. The ClusteringResult should have a predict method to
allow prediction of which cluster a new point should be assigned to.
This would be a change to the entire Clusterer API to return something that can
predict membership of new points. And given that there are a lot of ways to do
this (e.g. centroid linkage, single linkage, etc) the API would need a bit of
work.
It may be simpler to continue to return the list of clusters and then have a
ClusterPredictor interface that can be built using the the output of
clustering. You can then build this however you want. In this case you would
require a CentroidClusterPredictor:
{code:java}
@FunctionalInterface
public interface ClusterPredictor<T extends Clusterable> {
/**
* Predict the cluster membership of the given {@link Clusterable} instance.
*
* @param point the {@link Clusterable} instance
* @return the predicted cluster
*/
Cluster<T> predict(T point);
// E.g. for centroids
static <T extends Clusterable> ClusterPredictor<T> centroidPredictor(
List<? extends Cluster<T>> clusters, DistanceMeasure measure) {
List<double[]> centroids = clusters.stream()
.map(Cluster::centroid)
.map(Clusterable::getPoint)
.collect(Collectors.toList());
return new ClusterPredictor<T>() {
@Override
public Cluster<T> predict(T point) {
double[] a = point.getPoint();
double min = measure.compute(a, centroids.get(0));
int mini = 0;
for (int i = 1; i < centroids.size(); i++) {
double d = measure.compute(a, centroids.get(0));
if (d < min) {
min = d;
mini = i;
}
}
return clusters.get(mini);
}
};
}
}{code}
However once you have identified the cluster you still need an API to find the
closest points in the cluster.
The solution suggested to find the closest point would fit your use case for
centroid linkage prediction (modified) if it was modified to find the N closest:
{code:java}
Clusterable pointToPredict = ...
int topN = ...
// Search in all cluster centroids
// topN = 1
int[] bestClusterIndexes = aUtilClass.closestIndexes(clusters, measure,
pointToPredict, 1);
Cluster bestCluster = clusters.get(bestClusterIndexes [0]);
// search in the best cluster
int[] closestPointsIndexes =
aUtilClass.closestIndexes(bestCluster.getPoints(), measure, pointToPredict,
topN);
List<Clusterable> closestPoints = Arrays.stream(closestPointsIndexes )
.map(i ->
bestCluster.getPoints().get(i))
.collect(Collectors.toList());
{code}
Becomes:
{code:java}
public interface Clusterable {
/**
* @param list List of points.
* @param dist Distance measure.
* @param count The number of closest points.
* @return the points from the {@code list} that are closest to this point.
*/
default <T extends Clusterable> List<T> closest(Collection<T> list,
DistanceMeasure dist,
int count) {
// ...
}
}
Clusterable pointToPredict = ...
List<Clusterable> centroids = ...
Clusterable bestCluster = pointToPredict.closest(centroids, measure,
1).get(0);
int topN = ...
List<Clusterable> closestClusterPoints =
pointToPredict.closest(/* bestCluster members */, topN);
{code}
You still have to be able to convert the best cluster centroid into a list of
cluster members. But given that you have to create the centroids for each
Cluster you can just create centroid as a custom class that implements
Clusterable and stores a reference to the cluster members.
If the API returns the top N closest then you have to decide if the returned
collection is sorted. The top N can be stored in a
[Heap|https://en.wikipedia.org/wiki/Heap_(data_structure)] for efficient
search. If you forgo the sorted requirement you can return the heap contents
directly (the heap will not be strictly ordered). Otherwise you would have to
sort the heap at the end of the search.
PS. The newly added ClusterEvaluator has no main javadoc.
> "chooseInitialCenters" should move out from KMeansPlusPlusClusterer
> -------------------------------------------------------------------
>
> Key: MATH-1524
> URL: https://issues.apache.org/jira/browse/MATH-1524
> Project: Commons Math
> Issue Type: Improvement
> Reporter: Chen Tao
> Priority: Major
> Attachments: centroid.png, getCenter.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> There are two reason for "chooseInitialCenters" should be move out from
> "KMeansPlusPlusClusterer":
> # k-means++ clusterer is a special case of k-means clusterer, that k-means++
> initialize the cluster centers with k-means++ algorithm. Another case is
> initialize the cluster centers with random points.
> # The mini batch k-means will reuse "chooseInitialCenters".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)