On Wed, Jul 06, 2005 at 07:48:36AM -0700, Peter Fritz wrote: > Wondering if a reuse percentage would be useful? Maybe have a parameter > that specifies reuse should only be used n% of the time. eg "--reuse > 0.90" would cause 90% of messages with X-Spam-Status headers to be > reused, a random 10% would have full net checks run. A value of 1.0 > would be the default. Would be interesting to see how rescanning a > percentage of messages impacts final score generation, with the idea > being that some messages that slipped under the radar initially may hit > more net rules. I recognise that we still want to score based on what > actually hit at the time, so this may not offer much. Will have to do > some testing. Trying to find a balance between recycling information > already available, mass-check network load, and ideal scoring.
Interesting idea, but the point is to avoid hindsight (messages scanned now may now hit blocklists that they wouldn't have hit at the time). Generally if we're not reusing all messages, its because we can't, and thus we can't reuse any messages. So I don't think this would be useful. > Secondly, for a larger corpus, wondering if there is much difference > between perceptron scores for last 6 months, and last 1 month. If you > already have the ham/spam.log for the last 6 months, complete with the > "time" field, how much do the perceptron scores differ for the last 6 > months, and last 1 month? The thinking behind this is in moving towards > more regular rule score updates (at least locally), based on the current > flavour of spam. It may be a self defeating exercise though, if spam > and scores are both moving targets. You're welcome to try this once the mass-checks are submitted. I'd be interested in your results. > Some observations about mass-checks. Not sure if the instructions for > CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as > applicaple to mass-check with --reuse runs as full runs. My > understanding with --reuse is that if a network rule previously hit on a > message, it will be listed in the rules hit (spam.log) during > mass-checks but won't contribute to the recorded score of the message. > Hence, messages may have hit many network tests, but appear in the > spam.log with a low score, and therefore float to the top when reviewing > low-scoring spam, even though the original spam got a high score because > of network tests. Makes it hard to find false positives in the noise. > One solution to this would be to have the reuse flag record a more > accurate score in ham/spam.log for network tests, rather than zeroing > them out, but I don't have robust way of doing this yet. > http://wiki.apache.org/spamassassin/CorpusCleaning --reuse is a dirty hack, as much as Dan might claim otherwise. :-) That actually isn't a problem I had thought of (more obvious ones come to mind). > Finally, some observations from some limited mass-checks locally. > Running with 3.1.0-pre2, I end up with a badrules file of 4153 lines, > which seems quite a lot. Also my perceptron.scores file does not appear > to generate scores for BAYES_* rules, despite being listed in freqs. I > suspect I need to modify a mutable flag or similar somewhere > (tmp/rules.pl?), but just wondering why they don't get rescored by > default? In practice my hit/miss rate with SA is very good, but the > generated scores seem to be quite poor (probably need to double check my > corpus too). Info from freqs and perceptron.scores below. Make sure you're generating scoreset 3 results. -- Duncan Findlay
signature.asc
Description: Digital signature
