[ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415359#comment-13415359
 ] 

Pat Ferrel commented on MAHOUT-1045:
------------------------------------

The evaluator code does no error checking and so assumes all input is valid. My 
style would be to put checks for edge conditions in the evaluators. Like make 
sure the denominator is never 0, etc. This might hide some deeper problems 
though. 

I assume what you are saying about the same doc name means the same item was 
chosen five times? I strongly suspect that there will be cases where an 
identical weighted vector will have n different names so you can't get away 
with checking for uniqueness of representative points alone, you will still 
have the problem of a singularity (borrowing a physics term) cluster. The 
clustering algorithm may even accidentally make the centroid the same as the 
rest of the points and I suspect that would cause different problems. I think 
these cases are all fairly likely to come up in large crawls.

Not sure what else the pruning process is for but in this case I'd toss the 
cluster from the intra-cluster evaluation but not necessarily the inter-cluster 
density eval (though it might break some math there too). Which leans us 
towards scapping the pruning for evaluation because it removes the cluster from 
both calculations and maybe others too? 

If pruning is supposed to catch all undesirable conditions for all evaluations 
it seems like a lot of coupling with the evaluation algorithms and therefore 
fragile with respect to changes in algorithm and data conditions.

So I guess I agree with your last statement.
                
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is 
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
> also seen several cases where CDbw fails to return any results but have not 
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff 
> Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to