|
Thanks. I will look at this. I'm on a
fairly active witch-hunt for bugs. I'm guessing it's going to
have to be the configuration change for message boundaries.
On 8/12/2012 10:38 AM, John Hardin wrote:
On Sun, 12 Aug 2012, Kevin A. McGrail wrote:
On 8/12/2012 12:34 AM, John Hardin wrote:
It's still vastly underreporting my
corpora.
Is this what it is reporting?
ls -al *jhar* | grep Aug | grep -v \~ | awk '{print $9;}' |
xargs wc -l
241 ham-bb-jhardin.log
7 ham-bb-jhardin_fraud.log
243 ham-net-bb-jhardin.log
7 ham-net-bb-jhardin_fraud.log
104 spam-bb-jhardin.log
23 spam-bb-jhardin_fraud.log
99 spam-net-bb-jhardin.log
28 spam-net-bb-jhardin_fraud.log
752 total
Close, but not exact, and the spam corpus counts in the "set 0,
broken down by contributor" section differ from the counts in the
"corpus quality" section.
"set 0":
ham-bb-jhardin: 235
ham-bb-jhardin_fraud: 1
ham-net-bb-jhardin: 237
ham-net-bb-jhardin_fraud: 1
spam-bb-jhardin: 65
spam-bb-jhardin_fraud: 17
spam-net-bb-jhardin: 63
spam-net-bb-jhardin_fraud: 22
"corpus quality":
ham-bb-jhardin: 235
ham-bb-jhardin_fraud: 1
ham-net-bb-jhardin: 237
ham-net-bb-jhardin_fraud: 1
spam-bb-jhardin: 98
spam-bb-jhardin_fraud: 17
spam-net-bb-jhardin: 93
spam-net-bb-jhardin_fraud: 22
Here are the message counts from the master copies of my uploaded
corpora mailboxes based on /^From\s/:
fraud/corpus_ham_fraud.mbox: 25
fraud/spam: 5628
public/ham: 6092
public/spam: 7197
--
Kevin A. McGrail
President
Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422
http://www.pccc.com/
703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-359-8451 (fax)
[email protected]
|