Thanks.  I will look at this.  I'm on a fairly active witch-hunt for bugs.  I'm guessing it's going to have to be the configuration change for message boundaries.

On 8/12/2012 10:38 AM, John Hardin wrote:
On Sun, 12 Aug 2012, Kevin A. McGrail wrote:

On 8/12/2012 12:34 AM, John Hardin wrote:
 It's still vastly underreporting my corpora.

Is this what it is reporting?

ls -al *jhar* | grep Aug  | grep -v \~ | awk '{print $9;}' | xargs wc -l
     241 ham-bb-jhardin.log
       7 ham-bb-jhardin_fraud.log
     243 ham-net-bb-jhardin.log
       7 ham-net-bb-jhardin_fraud.log
     104 spam-bb-jhardin.log
      23 spam-bb-jhardin_fraud.log
      99 spam-net-bb-jhardin.log
      28 spam-net-bb-jhardin_fraud.log
     752 total

Close, but not exact, and the spam corpus counts in the "set 0, broken down by contributor" section differ from the counts in the "corpus quality" section.

"set 0":
    ham-bb-jhardin: 235
    ham-bb-jhardin_fraud: 1
    ham-net-bb-jhardin: 237
    ham-net-bb-jhardin_fraud: 1
    spam-bb-jhardin: 65
    spam-bb-jhardin_fraud: 17
    spam-net-bb-jhardin: 63
    spam-net-bb-jhardin_fraud: 22

"corpus quality":
    ham-bb-jhardin: 235
    ham-bb-jhardin_fraud: 1
    ham-net-bb-jhardin: 237
    ham-net-bb-jhardin_fraud: 1
    spam-bb-jhardin: 98
    spam-bb-jhardin_fraud: 17
    spam-net-bb-jhardin: 93
    spam-net-bb-jhardin_fraud: 22

Here are the message counts from the master copies of my uploaded corpora mailboxes based on /^From\s/:

    fraud/corpus_ham_fraud.mbox: 25
    fraud/spam: 5628
    public/ham: 6092
    public/spam: 7197




--
Kevin A. McGrail
President

Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422

http://www.pccc.com/

703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-359-8451 (fax)
[email protected]

Reply via email to