[
https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915814#action_12915814
]
Derek O'Callaghan commented on MAHOUT-513:
------------------------------------------
Hi Jeff,
In this case it appears that there are ~20 points in the cluster, and they're
all almost identical to each other. It's a text-clustering problem, using
reduced dimensionality, and these original 20 points have almost identical
terms. I'm not sure either what the solution is, this is an acceptable cluster
which so happens to be quite dense, so it'd be good to see this in the results.
Having said that, the average density will then be skewed as you say, as the
remaining clusters in this case are nowhere near as dense. I need to think
about it a bit more.
I'm also getting a couple of strange values in the CDbwEvaluator, I suspect it
could be a similar issue but I haven't had a chance to confirm it yet.
Thanks,
Derek
> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
> Key: MAHOUT-513
> URL: https://issues.apache.org/jira/browse/MAHOUT-513
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.3
> Reporter: Jeff Eastman
> Assignee: Jeff Eastman
> Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent
> changes, and I'm running into a problem whereby the average intra-cluster
> density can be set to NaN. Looking into it, it seems to happen for clusters
> containing points which are very close to the centroid. For example, I have
> a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682,
> -21.2463385742157, -5.875047828899212, -0.9835694086952026,
> 0.27940199394708054, -0.36402079609289706, 0.5201946127074457,
> -0.47084217746293855, -0.14380397719670499, -0.10441028152861194,
> 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical
> to the centroid of this cluster, but I'm assuming there's some small
> difference as "if (!vector.equals(clusterI.getCenter()))" in
> ClusterEvaluator.invalidCluster() is always returning false for these points,
> and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max"
> distances are ending up with the same value, and the density from "double
> density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here
> are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the
> solution is here, should invalidCluster() check that the the difference
> between the centroid and the candidate representative point is greater than a
> certain threshold, which would cause such a cluster to be pruned? Or is the
> fix in the intraClusterDensity() calculation to handle the case where min =
> max?
> BTW would you prefer that I create a Jira to record these issues, or is it
> okay to send them to the dev list as I've been doing?
> Thanks,
> Derek
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.