On Sun, 26 Feb 2017 19:35:11 +1000 James Birkett wrote: > Hi, > > Bug 6108 says is that the pyzor plugin ignores whitelisting entirely, > and a comment on that bug suggests updating the Pyzor plugin to use > Wilson Score formula described here > http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
SpamAssassin allows easy reporting of whole mailboxes of spam, but to report ham you have to go to the website and type in captchas (I think there are a few trusted admins that can bypass this). There's liable to be a substantial bias in reporting, and the bias is likely to vary strongly from hash to hash. I don't know for sure, but I suspect that a lot of whitelisting counts may be historic. Emails where the body contains only links or addresses get predigested down to a null string and all have the same hash with a 26.6:1 reporting ratio. There are numerous variants with additional boilerplate text. To eliminate this class of FP you need a threshold of at least 0.97. All in all I don't think it makes sense to use anything other than very high thresholds. If you want to persue this I would suggest using a threshold of at least 0.99 and varying the z parameter instead. > I have tested this using a spreadsheet and our own corpus of spam and > ham email and got good results, I think a lot of the benefit you are seeing comes from the fact that you are rescanning corpus mail rather than using the pyzor counts at delivery time. Using Wilson doesn't just affect whitelisted emails it also sets the minimum number of reports needed for a pyzor hit. For your three rules that rises from 5 to 381, 22 and 12 (assuming that the very low z value in the first rule is an error). In production you will lose pyzor hits, and for new hashes, the most valuable early hits. I think pyzor's own defaults are sensible - just ignore anything that's whitelisted. > I wanted to make it possible to have multiple spamassassin rules with > different wilson score parameters giving different spamassassin > scores, but obviously we don't want to query pyzor multiple times on > the same mail so I've changed the actual pyzor lookup to be done > during extract_metadata() instead > of when the eval-rule is run. The usual way this would be done is to have both functions look for cached counts and run pyzor if they aren't there - see the BAYES_* rules for an example.
