[
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414774#comment-13414774
]
Jeff Eastman commented on MAHOUT-1045:
--------------------------------------
I also get the first NaN with cluster 33465. It has 6 representative points:
5 identical vectors like this:
{3077:0.09790894164219517,4973:0.1884642719752401,6340:0.11433534252102742,7096:0.26375115412812483,7729:0.1579332024718107,8266:0.2592449794079855,9311:0.10461220497464459,11472:0.06164575325915021,13427:0.1747753834897376,13438:0.06393054441982463,14399:0.16359365494209394,14692:0.06929109554243788,15186:0.17966648450303982,15780:0.046454420041688316,15791:0.0731677731970443,21692:0.2244867856209188,22814:0.16150377853136402,23483:0.130108231430041,25323:0.08123791103459937,31633:0.266528727390838,32172:0.17767387631551967,32522:0.08487072539355776,33136:0.1370203379603993,33815:0.2873453848941226,39598:0.07758306663660308,48009:0.1279477350859634,50625:0.24661162653957963,52392:0.1555548032973563,53378:0.08022117855049148,54994:0.11022622928504641,59960:0.10656817176360436,60167:0.112475120915112,60808:0.19365247390752108,61246:0.12696983521304098,62779:0.15779491042479035,68657:0.12754867891049163,68768:0.15258151446035362,70703:0.1207780185942059,70936:0.0663956144057571,71349:0.07686518411114775,71912:0.10194260052056149,73137:0.20056701730214407,75223:0.06801198254209008}
and one (the cluster center) like this:
{13438:0.06393054441982461,6340:0.11433534252102741,33815:0.2873453848941226,39598:0.07758306663660308,54994:0.11022622928504643,11472:0.06164575325915022,4973:0.1884642719752401,13427:0.17477538348973762,23483:0.130108231430041,7729:0.15793320247181072,8266:0.2592449794079856,50625:0.24661162653957966,48009:0.12794773508596344,14399:0.16359365494209396,73137:0.20056701730214407,53378:0.08022117855049148,7096:0.26375115412812483,59960:0.10656817176360435,52392:0.1555548032973563,15186:0.17966648450303982,9311:0.1046122049746446,3077:0.09790894164219519,25323:0.08123791103459939,32172:0.17767387631551967,71349:0.07686518411114775,15791:0.0731677731970443,32522:0.08487072539355776,21692:0.22448678562091878,62779:0.15779491042479035,60167:0.11247512091511201,22814:0.16150377853136402,33136:0.13702033796039934,15780:0.046454420041688316,68657:0.12754867891049163,31633:0.26652872739083794,68768:0.15258151446035365,60808:0.1936524739075211,75223:0.06801198254209008,61246:0.12696983521304098,70703:0.1207780185942059,14692:0.06929109554243788,70936:0.0663956144057571,71912:0.10194260052056149}
Somehow, the CosineDistanceMeasure computes the distance between these two as
0. Seems like it should be more like 1, but I don't know why. The
lengthSquaredV1 = 1.0000000000000002 = lengthSquaredV2 = dotProduct.
> Cluster evaluators returning bad results
> ----------------------------------------
>
> Key: MAHOUT-1045
> URL: https://issues.apache.org/jira/browse/MAHOUT-1045
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6, 0.7, 0.8
> Environment: Several environments and data sets
> Reporter: Pat Ferrel
> Fix For: 0.8
>
> Attachments: MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have
> also seen several cases where CDbw fails to return any results but have not
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff
> Eastman.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira