On Jan 4, 2012, at 5:29 PM, Ted Dunning wrote:

> Stripping quoted text is very important.

Should be easy enough to add to the MailProcessor. I'll take a look at adding 
it.


> 
> Otherwise, you get a failure mode where the cross-validation in the
> CrossFoldLearner gives you an unrealistically optimistic view of things.
> This happens because successive documents look too much alike.
> 
> The result is that performance appears to get good (to the CFL) so the
> evolutionary process clamps down on the learning rate way too soon.  You
> get bad results on held out data because of this.
> 
> On Wed, Jan 4, 2012 at 2:24 PM, Lance Norskog <goks...@gmail.com> wrote:
> 
>> Sorry, cocoon v.s. commons.
>> 
>> On Wed, Jan 4, 2012 at 2:24 PM, Lance Norskog <goks...@gmail.com> wrote:
>>> I have a separate solution: strip the quoted text. Quoted text in the
>>> emails spams the term vectors; just plain TF-IDF is not enough to
>>> combat this. Lucene has a lot of tools besides TFi-IDF.
>>> 
>>> I have a patch, gotta start the JIRA. Also added more measurements to
>>> the confusion matrix. I want to get a good measurement of the
>>> performance on each producer and consumer, not just a global ratio.
>>> 'testnb' gives 80% but one of the false boxes has a 1. This is bogus.
>>> (I'm using your complete corpus of commons v.s. cocoon, classifying
>>> dev v.s. user.)
>>> 
>>> On Wed, Jan 4, 2012 at 6:57 AM, Grant Ingersoll (Updated) (JIRA)
>>> <j...@apache.org> wrote:
>>>> 
>>>>    [
>> https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>> 
>>>> Grant Ingersoll updated MAHOUT-939:
>>>> -----------------------------------
>>>> 
>>>>   Attachment: MAHOUT-939.patch
>>>> 
>>>> Here's a start on this.  Added some more construction options to the
>> AdaptiveLogisticRegression class.  Still testing what values to use in
>> TrainASFEmail, but thought I would put this up for now.
>>>> 
>>>>> ASF Email SGD Examples don't produce good results
>>>>> -------------------------------------------------
>>>>> 
>>>>>                Key: MAHOUT-939
>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-939
>>>>>            Project: Mahout
>>>>>         Issue Type: Bug
>>>>>   Affects Versions: 0.6
>>>>>           Reporter: Grant Ingersoll
>>>>>           Assignee: Grant Ingersoll
>>>>>             Labels: MAHOUT_INTRO_CONTRIBUTE
>>>>>            Fix For: 0.7
>>>>> 
>>>>>        Attachments: MAHOUT-939.patch
>>>>> 
>>>>> 
>>>>> The SGD examples for the ASF email don't work all that well currently
>> in terms of quality.  Also, need to determine how much memory is required
>> for vectors of cardinality size 100K.
>>>> 
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>> For more information on JIRA, see:
>> http://www.atlassian.com/software/jira
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goks...@gmail.com
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to