[ https://issues.apache.org/jira/browse/MATH-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruiqi Dong updated MATH-1683: ----------------------------- Priority: Minor (was: Major) > ElkanKMeansPlusPlusClusterer throws ArrayIndexOutOfBoundsException when all > input points are identical > ------------------------------------------------------------------------------------------------------ > > Key: MATH-1683 > URL: https://issues.apache.org/jira/browse/MATH-1683 > Project: Commons Math > Issue Type: Bug > Components: legacy > Affects Versions: 4.0-beta1 > Reporter: Ruiqi Dong > Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > The {{ElkanKMeansPlusPlusClusterer}} class throws an > {{ArrayIndexOutOfBoundsException}} when attempting to cluster a dataset where > all points are identical and k > 1. This occurs during the k-means++ > initialization phase in the {{seed()}} method. > *Test Case:* > > {code:java} > @Test > public void testAllPointsIdentical() { > // Create 4 identical points > List<DoublePoint> points = Arrays.asList( > new DoublePoint(new double[]{0.0, 0.0}), > new DoublePoint(new double[]{0.0, 0.0}), > new DoublePoint(new double[]{0.0, 0.0}), > new DoublePoint(new double[]{0.0, 0.0}) > ); > > // Attempt to cluster into 2 clusters > ElkanKMeansPlusPlusClusterer<DoublePoint> clusterer = > new ElkanKMeansPlusPlusClusterer<>(2); > > // This throws ArrayIndexOutOfBoundsException > clusterer.cluster(points); > }{code} > *Test Result:* > > > {code:java} > java.lang.ArrayIndexOutOfBoundsException: -1 > at java.util.ArrayList.elementData(ArrayList.java:424) > at java.util.ArrayList.get(ArrayList.java:437) > at > org.apache.commons.math4.legacy.ml.clustering.ElkanKMeansPlusPlusClusterer.seed(ElkanKMeansPlusPlusClusterer.java:244) > at > org.apache.commons.math4.legacy.ml.clustering.ElkanKMeansPlusPlusClusterer.cluster(ElkanKMeansPlusPlusClusterer.java:128) > {code} > *Root Cause Analysis:* > In the {{seed()}} method, the k-means++ initialization algorithm fails when > all points are identical: > # After selecting the first center, all distances to this center are 0 > (since all points are identical) > # {{sumSqDist}} becomes 0 as all {{minDistances[i] = 0}} > # In subsequent iterations, {{p = sumSqDist * random.nextDouble()}} equals 0 > # The loop {{for (double cdf = 0; cdf < p; next++)}} never executes because > {{0 < 0}} is false > # {{next}} remains 0, causing {{points.get(next - 1)}} to access index -1 > -- This message was sent by Atlassian Jira (v8.20.10#820010)