On Mon, Dec 10, 2012 at 09:47:37PM -0500, Kevin A. McGrail wrote:
>  HAM: 120245 (150000 required)
> SPAM: 118080 (150000 required)
> Insufficient ham corpus to generate scores; aborting.
> Exit Status 8 is not zero for do-nightly-rescore-example
> 
> Same issue as before with the ham and spam counts unfortunately!

Could you help me understand?

I used to think that the problem was that many masscheck
submitters don't clean out old messages (spam must be younger
than 6 months and ham younger than 18 months according to
<https://wiki.apache.org/spamassassin/CorpusCleaning>), so the numbers
reported by ruleqa.spamassassin.org overestimate the number of messages
available.  However, this does not seem to be the only explanation.

For example,
<http://ruleqa.spamassassin.org/20121210-r1419267-n/HK_RANDOM_FROM/detail?s_corpus=1>
shows 286388 spam messages in corpus axb-foo from month 2012-11.
This alone is much more than the minimum number required.  (I hope they
are messages collected from many different recipients so as not to bias
things, but that's a different matter.)

So is the problem that axb's messages are reported too late?

In that case, and if the premise holds that overaged messages are not
to be used, it might help for axb to simply delete messages that are
too old anyway, just so that mass-check can finish earlier.

I understand that the alternative approach of having mass-check
verify that the age of a message is acceptable before actually
processing it would be a lot of work as mass-check currently uses
Mail::SpamAssassin::parse to find out the age of the message.

A simpler option would be to modify the auto-mass-check.sh script to
use incremental uploads, instead of uploading all log files after all
corpuses have been checked.  To that end, it should suffice to add
the -t flag to rsync (so that files are not transferred twice) and add
invocations of upload_results to ~/.auto-mass-check.cf.

(However, <http://rsync.spamassassin.org> shows that spam-axb-foo.log
has a timestamp of 12:05, just 3:15 hours after nightly-versions.txt.
As my version of auto-mass-check.sh does not use the -t or -a options with
rsync, this seems to suggest that axb already uses some such modification,
in which case I still don't understand where exactly the problem is.)

Thanks for any insight. ;)

Regards
Marc

Reply via email to