[ 
https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754590#action_12754590
 ] 

Grant Ingersoll commented on MAHOUT-163:
----------------------------------------

Hmm, deleting the out of cluster docs from the index seems pretty harsh for a 
class that is just supposed to print out labels, even if we do undelete them.  
If there were to be an error between those two events, that could screw up the 
index.  We should probably generate a DocIdSet of the docs out of the cluster 
and then use that in conjunction with a FilterIndexReader to skip, etc. those 
docs that are not in clusters.


> Get (better) cluster labels using Log Likelihood Ratio
> ------------------------------------------------------
>
>                 Key: MAHOUT-163
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-163
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-163.patch, mahout-163.patch, 
> mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels 
> instead of the top features of the centroid vector. LLR finds terms/phrases 
> which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to