On Jan 4, 2012, at 5:29 PM, Ted Dunning wrote: > Stripping quoted text is very important.
Should be easy enough to add to the MailProcessor. I'll take a look at adding it. > > Otherwise, you get a failure mode where the cross-validation in the > CrossFoldLearner gives you an unrealistically optimistic view of things. > This happens because successive documents look too much alike. > > The result is that performance appears to get good (to the CFL) so the > evolutionary process clamps down on the learning rate way too soon. You > get bad results on held out data because of this. > > On Wed, Jan 4, 2012 at 2:24 PM, Lance Norskog <goks...@gmail.com> wrote: > >> Sorry, cocoon v.s. commons. >> >> On Wed, Jan 4, 2012 at 2:24 PM, Lance Norskog <goks...@gmail.com> wrote: >>> I have a separate solution: strip the quoted text. Quoted text in the >>> emails spams the term vectors; just plain TF-IDF is not enough to >>> combat this. Lucene has a lot of tools besides TFi-IDF. >>> >>> I have a patch, gotta start the JIRA. Also added more measurements to >>> the confusion matrix. I want to get a good measurement of the >>> performance on each producer and consumer, not just a global ratio. >>> 'testnb' gives 80% but one of the false boxes has a 1. This is bogus. >>> (I'm using your complete corpus of commons v.s. cocoon, classifying >>> dev v.s. user.) >>> >>> On Wed, Jan 4, 2012 at 6:57 AM, Grant Ingersoll (Updated) (JIRA) >>> <j...@apache.org> wrote: >>>> >>>> [ >> https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] >>>> >>>> Grant Ingersoll updated MAHOUT-939: >>>> ----------------------------------- >>>> >>>> Attachment: MAHOUT-939.patch >>>> >>>> Here's a start on this. Added some more construction options to the >> AdaptiveLogisticRegression class. Still testing what values to use in >> TrainASFEmail, but thought I would put this up for now. >>>> >>>>> ASF Email SGD Examples don't produce good results >>>>> ------------------------------------------------- >>>>> >>>>> Key: MAHOUT-939 >>>>> URL: https://issues.apache.org/jira/browse/MAHOUT-939 >>>>> Project: Mahout >>>>> Issue Type: Bug >>>>> Affects Versions: 0.6 >>>>> Reporter: Grant Ingersoll >>>>> Assignee: Grant Ingersoll >>>>> Labels: MAHOUT_INTRO_CONTRIBUTE >>>>> Fix For: 0.7 >>>>> >>>>> Attachments: MAHOUT-939.patch >>>>> >>>>> >>>>> The SGD examples for the ASF email don't work all that well currently >> in terms of quality. Also, need to determine how much memory is required >> for vectors of cardinality size 100K. >>>> >>>> -- >>>> This message is automatically generated by JIRA. >>>> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>>> For more information on JIRA, see: >> http://www.atlassian.com/software/jira >>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com