[
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425091#comment-13425091
]
Pat Ferrel commented on MAHOUT-1045:
------------------------------------
I've had a chance to run this on several data sets and see no problems. Below
are the results I get for a couple sets. For two completely different crawls I
see a steadily increasing CDbw validity index as k increases. This doesn't tell
me much. I was hoping to get an indication of an optimal k for a given data set
but doesn't seem to do that. I may need to try much smaller increments,
although that isn't very practical.
If the data has some clumps at a given scale and other clumps at another scale
(which may well be the case) then using the per cluster measures and linking
different scales together may give better results. The per cluster value for
intra-cluster density does vary around a mean in what seems to be a normal or
log normal distribution. So tossing clusters based on a sigma test might be a
good idea.
I vote to close this bug.
I also vote to have the per cluster data and new scaled inter-cluster density
put into the ClusterDumpDriver's output file instead of leaving them in the
logger output.
--------------------------------------
Cluster Eval for small crawl for 34487 pages 76156 terms
clusters CDbw Inter-Cluster Density CDbw Intra-Cluster Density
CDbw Separation CDbw Validity Index clusters average Inter-cluster
Density Average Intra-cluster Density Scaled Inter-cluster density
500 0 1050.07236806084 187792.321370176 1.97E+08
500 0.928988162001239 0.666506501466008 0.946427126976625
1000 0 2224.68724332853 463618.902327256 1.03E+09
1000 0.945889416532285 0.643526550057345 0.945889416532285
2000 0 3129.61404306957 1863976.83410554 5.83E+09
2000 0.947064274614474 0.616175031765541 0.947064274614474
Cluster Eval for small crawl for 9686 docs and 27305 terms
clusters CDbw Inter-Cluster Density CDbw Intra-Cluster Density
CDbw Separation CDbw Validity Index clusters average Inter-cluster
Density Average Intra-cluster Density
300 0 1377.66044402807 37674.9918329275 5.19E+07
300 0.953138888958119 0.662978461499837
400 0 1317.72330604304 70756.173673846 9.32E+07 400
0.949625717325956 0.653041070121705
500 0 1386.28300980349 112213.199968639 1.56E+08
500 0.952490341799314 0.639917990128385
1000 0 2485.55318093317 466293.691473308 1.16E+09
1000 0.953935715416424 0.587833106832454
2000 0 4930.2878999304 1892114.62345989 9.33E+09 2000
0.954298464279533 0.55663214080861
3000 0.00E+00 3000 0.955378051490676
0.537693888785647
> Cluster evaluators returning bad results
> ----------------------------------------
>
> Key: MAHOUT-1045
> URL: https://issues.apache.org/jira/browse/MAHOUT-1045
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6, 0.7, 0.8
> Environment: Several environments and data sets
> Reporter: Pat Ferrel
> Fix For: 0.8
>
> Attachments: MAHOUT-1045.patch, MAHOUT-1045.patch, MAHOUT-1045.patch,
> MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have
> also seen several cases where CDbw fails to return any results but have not
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff
> Eastman.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira