[ https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182312#comment-13182312 ]
Ted Dunning commented on MAHOUT-939: ------------------------------------ Are these results with held-out data? Or are they what is reported by the cross fold learners? The cross fold stuff has a known issue when there is item to item correlation (as with quoted text or other conversational structure). This amounts to a target leak that causes the ALR to stop learning too early and to kill regularization. I think that the long-term solution is to give the ALR training and test sets and not worry about the cross validation. Cross validation helps with small data sets and hurts with large ones which seems a contradiction with our basic mission. Also, going with this option would radically decrease the memory required. > ASF Email Classification Examples don't always produce good results > ------------------------------------------------------------------- > > Key: MAHOUT-939 > URL: https://issues.apache.org/jira/browse/MAHOUT-939 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.6 > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Labels: MAHOUT_INTRO_CONTRIBUTE > Fix For: 0.7 > > Attachments: MAHOUT-939.patch, MAHOUT-939.patch, MAHOUT-939.patch, > strip_reject.patch > > > The classification examples for the ASF email don't work all that well > currently in terms of quality when it comes to more than a few labels. Also, > need to determine how much memory is required for vectors of cardinality size > 100K. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira