On Mon, Dec 10, 2012 at 09:47:37PM -0500, Kevin A. McGrail wrote: > HAM: 120245 (150000 required) > SPAM: 118080 (150000 required) > Insufficient ham corpus to generate scores; aborting. > Exit Status 8 is not zero for do-nightly-rescore-example > > Same issue as before with the ham and spam counts unfortunately!
Could you help me understand? I used to think that the problem was that many masscheck submitters don't clean out old messages (spam must be younger than 6 months and ham younger than 18 months according to <https://wiki.apache.org/spamassassin/CorpusCleaning>), so the numbers reported by ruleqa.spamassassin.org overestimate the number of messages available. However, this does not seem to be the only explanation. For example, <http://ruleqa.spamassassin.org/20121210-r1419267-n/HK_RANDOM_FROM/detail?s_corpus=1> shows 286388 spam messages in corpus axb-foo from month 2012-11. This alone is much more than the minimum number required. (I hope they are messages collected from many different recipients so as not to bias things, but that's a different matter.) So is the problem that axb's messages are reported too late? In that case, and if the premise holds that overaged messages are not to be used, it might help for axb to simply delete messages that are too old anyway, just so that mass-check can finish earlier. I understand that the alternative approach of having mass-check verify that the age of a message is acceptable before actually processing it would be a lot of work as mass-check currently uses Mail::SpamAssassin::parse to find out the age of the message. A simpler option would be to modify the auto-mass-check.sh script to use incremental uploads, instead of uploading all log files after all corpuses have been checked. To that end, it should suffice to add the -t flag to rsync (so that files are not transferred twice) and add invocations of upload_results to ~/.auto-mass-check.cf. (However, <http://rsync.spamassassin.org> shows that spam-axb-foo.log has a timestamp of 12:05, just 3:15 hours after nightly-versions.txt. As my version of auto-mass-check.sh does not use the -t or -a options with rsync, this seems to suggest that axb already uses some such modification, in which case I still don't understand where exactly the problem is.) Thanks for any insight. ;) Regards Marc
