On 8/8/2012 10:17 AM, John Hardin wrote:
On Wed, 8 Aug 2012, Kevin A. McGrail wrote:
On 8/7/2012 10:14 AM, John Hardin wrote:
On Tue, 7 Aug 2012, Kevin A. McGrail wrote:
> Anyone else seeing missing corpora?
> > Is this possibly a problem where corpora are not being included?
My uploaded corpora are not _missing_, but the number of messages
reported
for them in the corpora report on the masscheck results pages are far
lower than what is being uploaded. I've started rsync back down to
verify
and it's apparently not a matter of the upload failing. And I do
filter by
date before uploading so it's not a matter of my counting ten thousand
messages from 2002.
Can you point me out the masscheck page that you are seeing the
difference on?
On any masscheck report, it's listed in two places:
(1) in the "set 0, broken down by contributor" you can hover over the
hits for spam and ham for every corpus/result set and see the hits and
total messages used to calculate the percentage
(2) at the bottom if you expand the "Corpus quality" report and see a
more detailed brakdown of the corpus/results contents
Here are my corpora counts at my end (by the number of '^From\s'):
fraud/spam: 5613
fraud/ham: 0
public/spam: 7173
public/ham: 6069
Here are the numbers from the Corpus Quality report:
bb-jhardin_fraud Spam messages Ham messages
TOTAL: 17 (0%) 1 (0%)
bb-jhardin Spam messages Ham messages
TOTAL: 100 (0%) 235 (0%)
I don't know where the single message in the fraud/ham corpus is from,
I may have uploaded a single dummy and forgetten about it.
You can see the other corpora are either being counted/parsed
incorrectly or are being filtered somehow.
Strangely enough, the count for the public/spam corpus is different
between the "set 0" count and the "Corpus quality" report: 67 vs. 100.
Thanks. Can you confirm the exact url you are visiting for this
report. I want to remove all assumptions from the mix.
I'm not uploading logs, I'm uploading the message corpora for
centralized masschecks.
Are you sure? Are you uploading other than the logs?
I show masscheck logs like these because you aren't actually uploading
the emails (which is correct, I believe):
-rw-r--r-- 1 rsync rsync 391675 Aug 8 09:15 ham-bb-jhardin.log
-rw-r--r-- 1 rsync rsync 391679 Aug 7 09:16 ham-bb-jhardin.log~
-rw-r--r-- 1 rsync rsync 1145 Aug 8 09:17
ham-bb-jhardin_fraud.log
-rw-r--r-- 1 rsync rsync 1145 Aug 7 09:19
ham-bb-jhardin_fraud.log~
-rw-r--r-- 1 rsync rsync 419449 Aug 4 09:07 ham-net-bb-jhardin.log
-rw-r--r-- 1 rsync rsync 420618 Jul 28 09:06
ham-net-bb-jhardin.log~
-rw-r--r-- 1 rsync rsync 1220 Aug 4 09:09
ham-net-bb-jhardin_fraud.log
-rw-r--r-- 1 rsync rsync 1220 Jul 28 09:08
ham-net-bb-jhardin_fraud.log~
-rw-r--r-- 1 rsync root 4639820 Oct 1 2009
ham-rescore-bb-jhardin.log
-rw-r--r-- 1 rsync rsync 222982 Aug 8 09:15 spam-bb-jhardin.log
-rw-r--r-- 1 rsync rsync 226858 Aug 7 09:16 spam-bb-jhardin.log~
-rw-r--r-- 1 rsync rsync 67181 Aug 8 09:17
spam-bb-jhardin_fraud.log
-rw-r--r-- 1 rsync rsync 67181 Aug 7 09:19
spam-bb-jhardin_fraud.log~
-rw-r--r-- 1 rsync rsync 226058 Aug 4 09:07
spam-net-bb-jhardin.log
-rw-r--r-- 1 rsync rsync 232278 Jul 28 09:06
spam-net-bb-jhardin.log~
-rw-r--r-- 1 rsync rsync 37934 Aug 4 09:09
spam-net-bb-jhardin_fraud.log
-rw-r--r-- 1 rsync rsync 25983 Jul 28 09:08
spam-net-bb-jhardin_fraud.log~
-rw-r--r-- 1 rsync root 2491637 Oct 1 2009
spam-rescore-bb-jhardin.log
Can you remind me of the issue so I can respond intelligently?
When I run masschecks locally against an up-to-date repo, it is not
setting the message boundary RE properly end gets scads of
uninitialized variable errors trying to parse the corpus mailbox
files. Last I looked, I added some warn() output and it was setting
the default RE properly but then appeared to be resetting it later
somewhere.
Sorry about that. I've reopened the bug. I believe I thought that was
resolved by the conf changes Mark Martinec made so I dropped it.
Regards,
KAM