Thoughts about Future Rescoring
  ===============================
  Before that rescoring, we may want to have a serious discussion about
  reducing score pile-up in the case where multiple production DNSBL's
  all hit at the same time.  Adam Katz' approach is one possibility,
  albeit confusing to users because users see subtractions in the score
  reports. There may be other better approaches to this.

Do you mean rules like KHOP_DNSBL_BUMP and KHOP_DNSBL_ADJ?

The current score-setting algorithm seems to assume orthogonal rules, or
rather a set of rules that test independent properties.  DNSBLs (and
DNSWLs) are fundamentally different, because they are different entity's
estimates of a single property.

Consider a world where 100K IP addresses send spam, and there are 8
DNSBLs.  Some list 80K, and some only 10K, and some list non-spammy
addresses.  Absent concerns about training on noise, one could take all
256 combinations of listed/not-listed, and treat them each as separate
situation, assigning each combination a score.  The problem with this
approach is that as you get k blacklists 2^k becomes big and the number
of messages in many bins is too small.

If we force "not listed in any" to zero, sort of like rules not hittinng
is zero score, then for 2 BLs we have 3 rules: A, B and A+B.  If A gets
2 points and B 1 and they largely overlap, then it seems very likely
that A+B deserves 2.2ish rather than 3.  If one accepts the "score the
overall situation" premise with letting all 3 scores float, then the
current method is much like forcing the 3 scores to have a particular
relationship that may not make sense.

I suggest adding infrastructure to declare a set of k scoring rules as
non-independent, which has the effect of adding 2^k-k-1 joint-situation
rules that can then be assigned scores different from the sum of the
individual scores.  For k=3, one would need 7 rules total, and thus 4
more (AB, AC, BC, ABC).

Then, when finding sets of rules that have high overlap in the corpus
with the additional property that the rules are differing evidence of
the same underlying property, we could add a grouped-rule declaration.

Arguably almost all rules are correlated.  But the real problem is that
ham coming from blacklisted IP addresses is given multiple penalties
calculated under an incorrect assumption of independence.  (The same
problem exists for spam from whitelisted IP addresses.)  So perhaps a
way to search for correlations in need of addressing, which I'd define
as the A+B score being significantly different than the sum of the A and
B scores.

Reply via email to