On 12/11/2012 09:43 AM, Marc Andre Selig wrote:
On Mon, Dec 10, 2012 at 09:47:37PM -0500, Kevin A. McGrail wrote:
HAM: 120245 (150000 required)
SPAM: 118080 (150000 required)
Insufficient ham corpus to generate scores; aborting.
Exit Status 8 is not zero for do-nightly-rescore-example
Same issue as before with the ham and spam counts unfortunately!
Could you help me understand?
I used to think that the problem was that many masscheck
submitters don't clean out old messages (spam must be younger
than 6 months and ham younger than 18 months according to
<https://wiki.apache.org/spamassassin/CorpusCleaning>), so the numbers
reported by ruleqa.spamassassin.org overestimate the number of messages
available. However, this does not seem to be the only explanation.
For example,
<http://ruleqa.spamassassin.org/20121210-r1419267-n/HK_RANDOM_FROM/detail?s_corpus=1>
shows 286388 spam messages in corpus axb-foo from month 2012-11.
This alone is much more than the minimum number required. (I hope they
are messages collected from many different recipients so as not to bias
things, but that's a different matter.)
So is the problem that axb's messages are reported too late?
I don't think so - I aborted all my masschecks and others weren't
finished within the time frame.
In that case, and if the premise holds that overaged messages are not
to be used, it might help for axb to simply delete messages that are
too old anyway, just so that mass-check can finish earlier.
To old? my spam corpus isn't older than 90 days. I just have too much of
the crap.
I understand that the alternative approach of having mass-check
verify that the age of a message is acceptable before actually
processing it would be a lot of work as mass-check currently uses
Mail::SpamAssassin::parse to find out the age of the message.
like for example: I run
find /data/archive/generic -type f -mtime +14 -exec rm -f {} \;
find /data/archive/fraud -type f -mtime +90 -exec rm -f {} \;
BEFORE a masshceck
the foo corpus is not older than 30 days.
deletmail purges all older than 30 days
A simpler option would be to modify the auto-mass-check.sh script to
use incremental uploads, instead of uploading all log files after all
corpuses have been checked. To that end, it should suffice to add
the -t flag to rsync (so that files are not transferred twice) and add
invocations of upload_results to ~/.auto-mass-check.cf.
logs aren't cumultative/incremental. They're re-written on every
masshceck run.
or do you mean something else?
Axb