On Sun, 12 Aug 2012, Kevin A. McGrail wrote:
On 8/12/2012 12:34 AM, John Hardin wrote:
It's still vastly underreporting my corpora.
Is this what it is reporting?
ls -al *jhar* | grep Aug | grep -v \~ | awk '{print $9;}' | xargs wc -l
241 ham-bb-jhardin.log
7 ham-bb-jhardin_fraud.log
243 ham-net-bb-jhardin.log
7 ham-net-bb-jhardin_fraud.log
104 spam-bb-jhardin.log
23 spam-bb-jhardin_fraud.log
99 spam-net-bb-jhardin.log
28 spam-net-bb-jhardin_fraud.log
752 total
Close, but not exact, and the spam corpus counts in the "set 0, broken
down by contributor" section differ from the counts in the "corpus
quality" section.
"set 0":
ham-bb-jhardin: 235
ham-bb-jhardin_fraud: 1
ham-net-bb-jhardin: 237
ham-net-bb-jhardin_fraud: 1
spam-bb-jhardin: 65
spam-bb-jhardin_fraud: 17
spam-net-bb-jhardin: 63
spam-net-bb-jhardin_fraud: 22
"corpus quality":
ham-bb-jhardin: 235
ham-bb-jhardin_fraud: 1
ham-net-bb-jhardin: 237
ham-net-bb-jhardin_fraud: 1
spam-bb-jhardin: 98
spam-bb-jhardin_fraud: 17
spam-net-bb-jhardin: 93
spam-net-bb-jhardin_fraud: 22
Here are the message counts from the master copies of my uploaded corpora
mailboxes based on /^From\s/:
fraud/corpus_ham_fraud.mbox: 25
fraud/spam: 5628
public/ham: 6092
public/spam: 7197
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
[email protected] FALaholic #11174 pgpk -a [email protected]
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
3 days until the 67th anniversary of the end of World War II