On Wed, 8 Aug 2012, Kevin A. McGrail wrote:

On 8/7/2012 10:14 AM, John Hardin wrote:
 On Tue, 7 Aug 2012, Kevin A. McGrail wrote:

>  Anyone else seeing missing corpora?
> > Is this possibly a problem where corpora are not being included?

 My uploaded corpora are not _missing_, but the number of messages reported
 for them in the corpora report on the masscheck results pages are far
 lower than what is being uploaded. I've started rsync back down to verify
 and it's apparently not a matter of the upload failing. And I do filter by
 date before uploading so it's not a matter of my counting ten thousand
 messages from 2002.

Can you point me out the masscheck page that you are seeing the difference on?

On any masscheck report, it's listed in two places:

(1) in the "set 0, broken down by contributor" you can hover over the hits for spam and ham for every corpus/result set and see the hits and total messages used to calculate the percentage

(2) at the bottom if you expand the "Corpus quality" report and see a more detailed brakdown of the corpus/results contents

Here are my corpora counts at my end (by the number of '^From\s'):

fraud/spam: 5613
fraud/ham: 0
public/spam: 7173
public/ham: 6069

Here are the numbers from the Corpus Quality report:

bb-jhardin_fraud Spam messages    Ham messages
  TOTAL:              17   (0%)   1   (0%)

bb-jhardin       Spam messages    Ham messages
  TOTAL:             100   (0%)   235   (0%)

I don't know where the single message in the fraud/ham corpus is from, I may have uploaded a single dummy and forgetten about it.

You can see the other corpora are either being counted/parsed incorrectly or are being filtered somehow.

Strangely enough, the count for the public/spam corpus is different
between the "set 0" count and the "Corpus quality" report: 67 vs. 100.

 How are the messages being counted?

I'm trying to figure that out.

 Might this be related somehow to the message boundary RE config issue I
 reported to you privately a few months back?

I can't see how since you aren't uploading messages just logs.

I'm not uploading logs, I'm uploading the message corpora for centralized masschecks.

Can you remind me of the issue so I can respond intelligently?

When I run masschecks locally against an up-to-date repo, it is not setting the message boundary RE properly end gets scads of uninitialized variable errors trying to parse the corpus mailbox files. Last I looked, I added some warn() output and it was setting the default RE properly but then appeared to be resetting it later somewhere.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 [email protected]    FALaholic #11174     pgpk -a [email protected]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  So Microsoft's invented the ASCII equivalent to ugly ink spots that
  appear on your letter when your pen is malfunctioning.
         -- Greg Andrews, about Microsoft's way to encode apostrophes
-----------------------------------------------------------------------
 7 days until the 67th anniversary of the end of World War II

Reply via email to