On Wed, 8 Aug 2012, Kevin A. McGrail wrote:
On 8/7/2012 10:14 AM, John Hardin wrote:
On Tue, 7 Aug 2012, Kevin A. McGrail wrote:
> Anyone else seeing missing corpora?
>
> Is this possibly a problem where corpora are not being included?
My uploaded corpora are not _missing_, but the number of messages reported
for them in the corpora report on the masscheck results pages are far
lower than what is being uploaded. I've started rsync back down to verify
and it's apparently not a matter of the upload failing. And I do filter by
date before uploading so it's not a matter of my counting ten thousand
messages from 2002.
Can you point me out the masscheck page that you are seeing the difference
on?
On any masscheck report, it's listed in two places:
(1) in the "set 0, broken down by contributor" you can hover over the hits
for spam and ham for every corpus/result set and see the hits and total
messages used to calculate the percentage
(2) at the bottom if you expand the "Corpus quality" report and see a more
detailed brakdown of the corpus/results contents
Here are my corpora counts at my end (by the number of '^From\s'):
fraud/spam: 5613
fraud/ham: 0
public/spam: 7173
public/ham: 6069
Here are the numbers from the Corpus Quality report:
bb-jhardin_fraud Spam messages Ham messages
TOTAL: 17 (0%) 1 (0%)
bb-jhardin Spam messages Ham messages
TOTAL: 100 (0%) 235 (0%)
I don't know where the single message in the fraud/ham corpus is from, I
may have uploaded a single dummy and forgetten about it.
You can see the other corpora are either being counted/parsed incorrectly
or are being filtered somehow.
Strangely enough, the count for the public/spam corpus is different
between the "set 0" count and the "Corpus quality" report: 67 vs. 100.
How are the messages being counted?
I'm trying to figure that out.
Might this be related somehow to the message boundary RE config issue I
reported to you privately a few months back?
I can't see how since you aren't uploading messages just logs.
I'm not uploading logs, I'm uploading the message corpora for centralized
masschecks.
Can you remind me of the issue so I can respond intelligently?
When I run masschecks locally against an up-to-date repo, it is not
setting the message boundary RE properly end gets scads of uninitialized
variable errors trying to parse the corpus mailbox files. Last I looked, I
added some warn() output and it was setting the default RE properly but
then appeared to be resetting it later somewhere.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
[email protected] FALaholic #11174 pgpk -a [email protected]
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
So Microsoft's invented the ASCII equivalent to ugly ink spots that
appear on your letter when your pen is malfunctioning.
-- Greg Andrews, about Microsoft's way to encode apostrophes
-----------------------------------------------------------------------
7 days until the 67th anniversary of the end of World War II