[jira] [Comment Edited] (MATH-1524) "chooseInitialCenters" should move out from KMeansPlusPlusClusterer

Alex Herbert (Jira) Fri, 13 Mar 2020 06:48:00 -0700


    [ 
https://issues.apache.org/jira/browse/MATH-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058725#comment-17058725
 ]


Alex Herbert edited comment on MATH-1524 at 3/13/20, 1:46 PM:
--------------------------------------------------------------

{quote}Is the distance between a point and a cluster always defined as the 
distance between that point and the cluster's centroid?
{quote}
Depends what you are doing. For hierarchical clustering you have lots of 
options, see this:

[Linkage 
criteria|https://en.wikipedia.org/wiki/Hierarchical_clustering#Linkage_criteria]

The choice of linkage criteria and the clustering algorithm are not always 
codependent. Choice of linkage can effect results. Centroid linkage is nice for 
n-spheres. When you have other shapes such as toroids then it is not suitable.

For DBSCAN linkage is a type of single-linkage clustering. The algorithm can be 
extremely fast if designed on top of efficient data structures for 
neighbourhood search such as [KD trees|https://en.wikipedia.org/wiki/K-d_tree] 
or grid based storage. This is only possible for certain assumption about the 
data for example if each dimension is weighted to be equal and the distance 
metric is the Euclidean norm. I note that the DBSCAN algorithm in Math is not 
optimal. It could be improved a lot with different data structures to store the 
processed items to eliminate use of Set and HashMap.

With regard to your use case the predict phase would involve searching against 
all clusters. As such this would be more easily optimised if the results of the 
training phase was not a list of clusters but instead a ClusteringResult 
object. This would be free to return an object containing not only the list of 
clusters but also any optimised data structures currently created during the 
clustering algorithm. The ClusteringResult should have a predict method to 
allow prediction of which cluster a new point should be assigned to.

This would be a change to the entire Clusterer API to return something that can 
predict membership of new points. And given that there are a lot of ways to do 
this (e.g. centroid linkage, single linkage, etc) the API would need a bit of 
work.

It may be simpler to continue to return the list of clusters and then have a 
ClusterPredictor interface that can be built using the the output of 
clustering. You can then build this however you want. In this case you would 
require a CentroidClusterPredictor:
{code:java}
@FunctionalInterface
public interface ClusterPredictor<T extends Clusterable> {
    /**
     * Predict the cluster membership of the given {@link Clusterable} instance.
     *
     * @param point the {@link Clusterable} instance
     * @return the predicted cluster
     */
    Cluster<T> predict(T point);

    // E.g. for centroids (this is not optimal but you get the idea)
    static <T extends Clusterable> ClusterPredictor<T> centroidPredictor(
        List<? extends Cluster<T>> clusters, DistanceMeasure measure) {
        List<double[]> centroids = clusters.stream()
                                           .map(Cluster::centroid)
                                           .map(Clusterable::getPoint)
                                           .collect(Collectors.toList());
        return new ClusterPredictor<T>() {
            @Override
            public Cluster<T> predict(T point) {
                double[] a = point.getPoint();
                double min = measure.compute(a, centroids.get(0));
                int mini = 0;
                for (int i = 1; i < centroids.size(); i++) {
                    double d = measure.compute(a, centroids.get(i));
                    if (d < min) {
                        min = d;
                        mini = i;
                    }
                }
                return clusters.get(mini);
            }
        };
    }
}{code}
However once you have identified the cluster you still need an API to find the 
closest points in the cluster.

The solution suggested to find the closest point would fit your use case for 
centroid linkage prediction (modified) if it was modified to find the N closest:
{code:java}
    Clusterable pointToPredict = ...
    int topN = ...

    // Search in all cluster centroids
    // topN = 1
    int[] bestClusterIndexes = aUtilClass.closestIndexes(clusters, measure, 
pointToPredict, 1);
    Cluster bestCluster = clusters.get(bestClusterIndexes [0]);

    // search in the best cluster
    int[] closestPointsIndexes = 
aUtilClass.closestIndexes(bestCluster.getPoints(), measure, pointToPredict, 
topN);
    List<Clusterable> closestPoints = Arrays.stream(closestPointsIndexes )
                                             .map(i -> 
bestCluster.getPoints().get(i))
                                             .collect(Collectors.toList());
{code}
Becomes:
{code:java}
public interface Clusterable {
    /**
     * @param list List of points.
     * @param dist Distance measure.
     * @param count The number of closest points.
     * @return the points from the {@code list} that are closest to this point.
     */
    default <T extends Clusterable> List<T> closest(Collection<T> list, 
                                                    DistanceMeasure dist, 
                                                    int count) {
        // ...
    }
}

    Clusterable pointToPredict = ...
    List<Clusterable> centroids = ...
    Clusterable bestCluster = pointToPredict.closest(centroids, measure, 
1).get(0);
    int topN = ...
    List<Clusterable> closestClusterPoints =
         pointToPredict.closest(/* bestCluster members */, measure, topN);
{code}
 
 You still have to be able to convert the best cluster centroid into a list of 
cluster members. But given that you have to create the centroids for each 
Cluster you can just create centroid as a custom class that implements 
Clusterable and stores a reference to the cluster members.

If the API returns the top N closest then you have to decide if the returned 
collection is sorted. The top N can be stored in a 
[Heap|https://en.wikipedia.org/wiki/Heap_(data_structure)] for efficient 
search. If you forgo the sorted requirement you can return the heap contents 
directly (the heap will not be strictly ordered). Otherwise you would have to 
sort the heap at the end of the search.

 

PS. The newly added ClusterEvaluator has no main javadoc.

 


was (Author: alexherbert):
{quote}Is the distance between a point and a cluster always defined as the 
distance between that point and the cluster's centroid?
{quote}
Depends what you are doing. For hierarchical clustering you have lots of 
options, see this:

[Linkage 
criteria|https://en.wikipedia.org/wiki/Hierarchical_clustering#Linkage_criteria]

The choice of linkage criteria and the clustering algorithm are not always 
codependent. Choice of linkage can effect results. Centroid linkage is nice for 
n-spheres. When you have other shapes such as toroids then it is not suitable.

For DBSCAN linkage is a type of single-linkage clustering. The algorithm can be 
extremely fast if designed on top of efficient data structures for 
neighbourhood search such as [KD trees|https://en.wikipedia.org/wiki/K-d_tree] 
or grid based storage. This is only possible for certain assumption about the 
data for example if each dimension is weighted to be equal and the distance 
metric is the Euclidean norm. I note that the DBSCAN algorithm in Math is not 
optimal. It could be improved a lot with different data structures to store the 
processed items to eliminate use of Set and HashMap.

With regard to your use case the predict phase would involve searching against 
all clusters. As such this would be more easily optimised if the results of the 
training phase was not a list of clusters but instead a ClusteringResult 
object. This would be free to return an object containing not only the list of 
clusters but also any optimised data structures currently created during the 
clustering algorithm. The ClusteringResult should have a predict method to 
allow prediction of which cluster a new point should be assigned to.

This would be a change to the entire Clusterer API to return something that can 
predict membership of new points. And given that there are a lot of ways to do 
this (e.g. centroid linkage, single linkage, etc) the API would need a bit of 
work.

It may be simpler to continue to return the list of clusters and then have a 
ClusterPredictor interface that can be built using the the output of 
clustering. You can then build this however you want. In this case you would 
require a CentroidClusterPredictor:
{code:java}
@FunctionalInterface
public interface ClusterPredictor<T extends Clusterable> {
    /**
     * Predict the cluster membership of the given {@link Clusterable} instance.
     *
     * @param point the {@link Clusterable} instance
     * @return the predicted cluster
     */
    Cluster<T> predict(T point);

    // E.g. for centroids (this is not optimal but you get the idea)
    static <T extends Clusterable> ClusterPredictor<T> centroidPredictor(
        List<? extends Cluster<T>> clusters, DistanceMeasure measure) {
        List<double[]> centroids = clusters.stream()
                                           .map(Cluster::centroid)
                                           .map(Clusterable::getPoint)
                                           .collect(Collectors.toList());
        return new ClusterPredictor<T>() {
            @Override
            public Cluster<T> predict(T point) {
                double[] a = point.getPoint();
                double min = measure.compute(a, centroids.get(0));
                int mini = 0;
                for (int i = 1; i < centroids.size(); i++) {
                    double d = measure.compute(a, centroids.get(i));
                    if (d < min) {
                        min = d;
                        mini = i;
                    }
                }
                return clusters.get(mini);
            }
        };
    }
}{code}
However once you have identified the cluster you still need an API to find the 
closest points in the cluster.

The solution suggested to find the closest point would fit your use case for 
centroid linkage prediction (modified) if it was modified to find the N closest:
{code:java}
    Clusterable pointToPredict = ...
    int topN = ...

    // Search in all cluster centroids
    // topN = 1
    int[] bestClusterIndexes = aUtilClass.closestIndexes(clusters, measure, 
pointToPredict, 1);
    Cluster bestCluster = clusters.get(bestClusterIndexes [0]);

    // search in the best cluster
    int[] closestPointsIndexes = 
aUtilClass.closestIndexes(bestCluster.getPoints(), measure, pointToPredict, 
topN);
    List<Clusterable> closestPoints = Arrays.stream(closestPointsIndexes )
                                             .map(i -> 
bestCluster.getPoints().get(i))
                                             .collect(Collectors.toList());
{code}
Becomes:
{code:java}
public interface Clusterable {
    /**
     * @param list List of points.
     * @param dist Distance measure.
     * @param count The number of closest points.
     * @return the points from the {@code list} that are closest to this point.
     */
    default <T extends Clusterable> List<T> closest(Collection<T> list, 
                                                    DistanceMeasure dist, 
                                                    int count) {
        // ...
    }
}

    Clusterable pointToPredict = ...
    List<Clusterable> centroids = ...
    Clusterable bestCluster = pointToPredict.closest(centroids, measure, 
1).get(0);
    int topN = ...
    List<Clusterable> closestClusterPoints =
         pointToPredict.closest(/* bestCluster members */, topN);
{code}
 
 You still have to be able to convert the best cluster centroid into a list of 
cluster members. But given that you have to create the centroids for each 
Cluster you can just create centroid as a custom class that implements 
Clusterable and stores a reference to the cluster members.

If the API returns the top N closest then you have to decide if the returned 
collection is sorted. The top N can be stored in a 
[Heap|https://en.wikipedia.org/wiki/Heap_(data_structure)] for efficient 
search. If you forgo the sorted requirement you can return the heap contents 
directly (the heap will not be strictly ordered). Otherwise you would have to 
sort the heap at the end of the search.

 

PS. The newly added ClusterEvaluator has no main javadoc.

 

> "chooseInitialCenters" should move out from KMeansPlusPlusClusterer
> -------------------------------------------------------------------
>
>                 Key: MATH-1524
>                 URL: https://issues.apache.org/jira/browse/MATH-1524
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Chen Tao
>            Priority: Major
>         Attachments: centroid.png, getCenter.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are two reason for "chooseInitialCenters" should be move out from 
> "KMeansPlusPlusClusterer":
> # k-means++ clusterer is a special case of k-means clusterer, that k-means++ 
> initialize the cluster centers with k-means++ algorithm. Another case is 
> initialize the cluster centers with random points.
> # The mini batch k-means will reuse "chooseInitialCenters". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (MATH-1524) "chooseInitialCenters" should move out from KMeansPlusPlusClusterer

Reply via email to