Hi Derek,

Let's consider why the intra-cluster density is being normalized by (max-min) in the first place. I confess I don't understand why the inter-cluster density is so normalized, but I copied the pattern from it out of blind faith.

Sean, Robin, Ted: One of you guys evidently wrote the inter-cluster density computation but did not include an intra-cluster computation in" Mahout In Action". The CDbwEvaluator calculates both using only the representative points (and may have been transcribed incorrectly from the paper to boot). Please chime in.

On 9/28/10 12:25 PM, Derek O'Callaghan (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915814#action_12915814
 ]

Derek O'Callaghan commented on MAHOUT-513:
------------------------------------------

Hi Jeff,

In this case it appears that there are ~20 points in the cluster, and they're 
all almost identical to each other. It's a text-clustering problem, using 
reduced dimensionality, and these original 20 points have almost identical 
terms. I'm not sure either what the solution is, this is an acceptable cluster 
which so happens to be quite dense, so it'd be good to see this in the results. 
Having said that, the average density will then be skewed as you say, as the 
remaining clusters in this case are nowhere near as dense. I need to think 
about it a bit more.

I'm also getting a couple of strange values in the CDbwEvaluator, I suspect it 
could be a similar issue but I haven't had a chance to confirm it yet.

Thanks,

Derek

ClusterEvaluator inter-cluster density returns NaN
--------------------------------------------------

                 Key: MAHOUT-513
                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.3
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman
             Fix For: 0.4


Hi Jeff,
I've been trying out the ClusterEvaluator class today since your recent 
changes, and I'm running into a problem whereby the average intra-cluster 
density can be set to NaN. Looking into it, it seems to happen for clusters 
containing points which are very close to the centroid.  For example, I have a 
cluster with:
Centroid:
{0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
and one of the representative points (3 per cluster):
[0.6075199543688894, -0.31650583874095506, 0.2027106147825682, 
-21.2463385742157, -5.875047828899212, -0.9835694086952026, 
0.27940199394708054, -0.36402079609289706, 0.5201946127074457, 
-0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 
0.06984850863354047, 0.014286758874801297]
As far as I can tell from debugging, the representative points look identical to the 
centroid of this cluster, but I'm assuming there's some small difference as "if 
(!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is 
always returning false for these points, and so the cluster isn't pruned from the list.
Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are 
ending up with the same value, and the density from "double density = (sum / count - min) / (max - 
min);" is calculated as NaN, e.g. here are the values I'm getting:
min = max = 1.5397509610616733E-7
count = 3
sum = 4.61925288318502E-7
max - min: 0.0
count - min: 2.9999998460249038
(sum / count - min) = 0.0
This then causes avgDensity to be calculated as NaN. I'm not sure what the 
solution is here, should invalidCluster() check that the the difference between 
the centroid and the candidate representative point is greater than a certain 
threshold, which would cause such a cluster to be pruned? Or is the fix in the 
intraClusterDensity() calculation to handle the case where min = max?
BTW would you prefer that I create a Jira to record these issues, or is it okay 
to send them to the dev list as I've been doing?
Thanks,
Derek

Reply via email to