[
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415813#comment-13415813
]
Pat Ferrel edited comment on MAHOUT-1045 at 7/17/12 12:30 AM:
--------------------------------------------------------------
I get
Inter-Cluster Density: 0.9464271269766443
Intra-Cluster Density: 0.593190786304747
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: 1050.0723680608382
CDbw Separation: 187792.32137017616
With lots of NaN's for clusters. I finally got the data into my UI so I can see
why we are getting NaN's. Lots of clusters where the pages are nearly or
exactly identical. The NaN's are a red flag and for most applications I expect
they will be of use in making clustering output have a gooey cream center--yum.
I think this line may be wrong. You want to divide by the number of
valid/non-NaN clusters don't you?
avgDensity = clusters.isEmpty() ? 0 : avgDensity / clusters.size();
The Intra-Cluster Density: 0.593190786304747 looks skewed low if you look at
the per cluster output.
The actual effect of the singularity clusters is that they are super dense,
which leads one to wonder if this shouldn't be reflected in the average somehow.
was (Author: pferrel):
I get
Inter-Cluster Density: 0.9464271269766443
Intra-Cluster Density: 0.593190786304747
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: 1050.0723680608382
CDbw Separation: 187792.32137017616
With lots of NaN's for clusters. I finally got the data into my UI so I can see
why we are getting NaN's. Lots of clusters where the pages are nearly or
exactly identical. The NaN's are a red flag and for most application I expect
they will be of use in making clustering output have a gooey cream center--yum.
I think this line may be wrong. You want to divide by the number of
valid/non-NaN clusters don't you?
avgDensity = clusters.isEmpty() ? 0 : avgDensity / clusters.size();
The Intra-Cluster Density: 0.593190786304747 looks skewed low if you look at
the per cluster output.
The actual effect of the singularity clusters is that they are super dense,
which leads one to wonder if this shouldn't be reflected in the average somehow.
> Cluster evaluators returning bad results
> ----------------------------------------
>
> Key: MAHOUT-1045
> URL: https://issues.apache.org/jira/browse/MAHOUT-1045
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6, 0.7, 0.8
> Environment: Several environments and data sets
> Reporter: Pat Ferrel
> Fix For: 0.8
>
> Attachments: MAHOUT-1045.patch, MAHOUT-1045.patch,
> first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have
> also seen several cases where CDbw fails to return any results but have not
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff
> Eastman.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira