On Sun, 26 Feb 2017 19:35:11 +1000
James Birkett wrote:

> Hi,
> 
> Bug 6108 says is that the pyzor plugin ignores whitelisting entirely,
> and a comment on that bug suggests updating the Pyzor plugin to use
> Wilson Score formula described here
> http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

SpamAssassin allows easy reporting of whole mailboxes of spam, but to
report ham you have to go to the website and type in captchas (I think
there are a few trusted admins that can bypass this). There's liable to
be a substantial bias in reporting, and the bias is likely to vary
strongly from hash to hash. I don't know for sure, but I suspect that a
lot of whitelisting counts may be historic.

Emails where the body contains only links or addresses get
predigested down to a null string and all have the same hash with a
26.6:1 reporting ratio. There are numerous variants with additional
boilerplate text. To eliminate this class of FP you need a threshold of
at least 0.97.

All in all I don't think it makes sense to use anything other than very
high thresholds. If you want to persue this I would suggest using a
threshold of at least 0.99 and varying the z parameter instead.


> I have tested this using a spreadsheet and our own corpus of spam and
> ham email and got good results,

I think a lot of the benefit you are seeing comes from the fact that
you are rescanning corpus mail rather than using the pyzor counts at
delivery time.

Using Wilson doesn't just affect whitelisted emails it also sets the
minimum number of reports needed for a pyzor hit. For your three rules
that rises from 5 to 381, 22 and 12 (assuming that the very low z value
in the first rule is an error). In production you will lose pyzor hits,
and for new hashes, the most valuable early hits.  

I think pyzor's own defaults are sensible - just ignore anything
that's whitelisted.  


> I wanted to make it possible to have multiple spamassassin rules with
> different wilson score parameters giving different spamassassin
> scores, but obviously we don't want to query pyzor multiple times on
> the same mail so I've changed the actual pyzor lookup to be done
> during extract_metadata() instead
> of when the eval-rule is run. 

The usual way this would be done is to have both functions look for
cached counts and run pyzor if they aren't there - see the BAYES_*
rules for an example.

Reply via email to