On Sun, 12 Aug 2012, Kevin A. McGrail wrote:

On 8/12/2012 12:34 AM, John Hardin wrote:
It's still vastly underreporting my corpora.

Is this what it is reporting?

ls -al *jhar* | grep Aug  | grep -v \~ | awk '{print $9;}' | xargs wc -l
     241 ham-bb-jhardin.log
       7 ham-bb-jhardin_fraud.log
     243 ham-net-bb-jhardin.log
       7 ham-net-bb-jhardin_fraud.log
     104 spam-bb-jhardin.log
      23 spam-bb-jhardin_fraud.log
      99 spam-net-bb-jhardin.log
      28 spam-net-bb-jhardin_fraud.log
     752 total

Close, but not exact, and the spam corpus counts in the "set 0, broken down by contributor" section differ from the counts in the "corpus quality" section.

"set 0":
        ham-bb-jhardin: 235
        ham-bb-jhardin_fraud: 1
        ham-net-bb-jhardin: 237
        ham-net-bb-jhardin_fraud: 1
        spam-bb-jhardin: 65
        spam-bb-jhardin_fraud: 17
        spam-net-bb-jhardin: 63
        spam-net-bb-jhardin_fraud: 22

"corpus quality":
        ham-bb-jhardin: 235
        ham-bb-jhardin_fraud: 1
        ham-net-bb-jhardin: 237
        ham-net-bb-jhardin_fraud: 1
        spam-bb-jhardin: 98
        spam-bb-jhardin_fraud: 17
        spam-net-bb-jhardin: 93
        spam-net-bb-jhardin_fraud: 22

Here are the message counts from the master copies of my uploaded corpora mailboxes based on /^From\s/:

        fraud/corpus_ham_fraud.mbox: 25
        fraud/spam: 5628
        public/ham: 6092
        public/spam: 7197


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 [email protected]    FALaholic #11174     pgpk -a [email protected]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 3 days until the 67th anniversary of the end of World War II

Reply via email to