On 8/8/2012 10:17 AM, John Hardin wrote:
On Wed, 8 Aug 2012, Kevin A. McGrail wrote:

On 8/7/2012 10:14 AM, John Hardin wrote:
 On Tue, 7 Aug 2012, Kevin A. McGrail wrote:

>  Anyone else seeing missing corpora?
> >  Is this possibly a problem where corpora are not being included?

My uploaded corpora are not _missing_, but the number of messages reported
 for them in the corpora report on the masscheck results pages are far
lower than what is being uploaded. I've started rsync back down to verify and it's apparently not a matter of the upload failing. And I do filter by
 date before uploading so it's not a matter of my counting ten thousand
 messages from 2002.

Can you point me out the masscheck page that you are seeing the difference on?

On any masscheck report, it's listed in two places:

(1) in the "set 0, broken down by contributor" you can hover over the hits for spam and ham for every corpus/result set and see the hits and total messages used to calculate the percentage

(2) at the bottom if you expand the "Corpus quality" report and see a more detailed brakdown of the corpus/results contents

Here are my corpora counts at my end (by the number of '^From\s'):

fraud/spam: 5613
fraud/ham: 0
public/spam: 7173
public/ham: 6069

Here are the numbers from the Corpus Quality report:

bb-jhardin_fraud Spam messages    Ham messages
  TOTAL:              17   (0%)   1   (0%)

bb-jhardin       Spam messages    Ham messages
  TOTAL:             100   (0%)   235   (0%)

I don't know where the single message in the fraud/ham corpus is from, I may have uploaded a single dummy and forgetten about it.

You can see the other corpora are either being counted/parsed incorrectly or are being filtered somehow.

Strangely enough, the count for the public/spam corpus is different
between the "set 0" count and the "Corpus quality" report: 67 vs. 100.
Thanks. Can you confirm the exact url you are visiting for this report. I want to remove all assumptions from the mix.

I'm not uploading logs, I'm uploading the message corpora for centralized masschecks.
Are you sure?  Are you uploading other than the logs?

I show masscheck logs like these because you aren't actually uploading the emails (which is correct, I believe):

-rw-r--r--   1 rsync    rsync     391675 Aug  8 09:15 ham-bb-jhardin.log
-rw-r--r--   1 rsync    rsync     391679 Aug  7 09:16 ham-bb-jhardin.log~
-rw-r--r-- 1 rsync rsync 1145 Aug 8 09:17 ham-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 1145 Aug 7 09:19 ham-bb-jhardin_fraud.log~
-rw-r--r--   1 rsync    rsync     419449 Aug  4 09:07 ham-net-bb-jhardin.log
-rw-r--r-- 1 rsync rsync 420618 Jul 28 09:06 ham-net-bb-jhardin.log~ -rw-r--r-- 1 rsync rsync 1220 Aug 4 09:09 ham-net-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 1220 Jul 28 09:08 ham-net-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync root 4639820 Oct 1 2009 ham-rescore-bb-jhardin.log
-rw-r--r--   1 rsync    rsync     222982 Aug  8 09:15 spam-bb-jhardin.log
-rw-r--r--   1 rsync    rsync     226858 Aug  7 09:16 spam-bb-jhardin.log~
-rw-r--r-- 1 rsync rsync 67181 Aug 8 09:17 spam-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 67181 Aug 7 09:19 spam-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync rsync 226058 Aug 4 09:07 spam-net-bb-jhardin.log -rw-r--r-- 1 rsync rsync 232278 Jul 28 09:06 spam-net-bb-jhardin.log~ -rw-r--r-- 1 rsync rsync 37934 Aug 4 09:09 spam-net-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 25983 Jul 28 09:08 spam-net-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync root 2491637 Oct 1 2009 spam-rescore-bb-jhardin.log



Can you remind me of the issue so I can respond intelligently?

When I run masschecks locally against an up-to-date repo, it is not setting the message boundary RE properly end gets scads of uninitialized variable errors trying to parse the corpus mailbox files. Last I looked, I added some warn() output and it was setting the default RE properly but then appeared to be resetting it later somewhere.

Sorry about that. I've reopened the bug. I believe I thought that was resolved by the conf changes Mark Martinec made so I dropped it.

Regards,
KAM

Reply via email to