Thoughts about Future Rescoring =============================== Before that rescoring, we may want to have a serious discussion about reducing score pile-up in the case where multiple production DNSBL's all hit at the same time. Adam Katz' approach is one possibility, albeit confusing to users because users see subtractions in the score reports. There may be other better approaches to this.
Do you mean rules like KHOP_DNSBL_BUMP and KHOP_DNSBL_ADJ? The current score-setting algorithm seems to assume orthogonal rules, or rather a set of rules that test independent properties. DNSBLs (and DNSWLs) are fundamentally different, because they are different entity's estimates of a single property. Consider a world where 100K IP addresses send spam, and there are 8 DNSBLs. Some list 80K, and some only 10K, and some list non-spammy addresses. Absent concerns about training on noise, one could take all 256 combinations of listed/not-listed, and treat them each as separate situation, assigning each combination a score. The problem with this approach is that as you get k blacklists 2^k becomes big and the number of messages in many bins is too small. If we force "not listed in any" to zero, sort of like rules not hittinng is zero score, then for 2 BLs we have 3 rules: A, B and A+B. If A gets 2 points and B 1 and they largely overlap, then it seems very likely that A+B deserves 2.2ish rather than 3. If one accepts the "score the overall situation" premise with letting all 3 scores float, then the current method is much like forcing the 3 scores to have a particular relationship that may not make sense. I suggest adding infrastructure to declare a set of k scoring rules as non-independent, which has the effect of adding 2^k-k-1 joint-situation rules that can then be assigned scores different from the sum of the individual scores. For k=3, one would need 7 rules total, and thus 4 more (AB, AC, BC, ABC). Then, when finding sets of rules that have high overlap in the corpus with the additional property that the rules are differing evidence of the same underlying property, we could add a grouped-rule declaration. Arguably almost all rules are correlated. But the real problem is that ham coming from blacklisted IP addresses is given multiple penalties calculated under an incorrect assumption of independence. (The same problem exists for spam from whitelisted IP addresses.) So perhaps a way to search for correlations in need of addressing, which I'd define as the A+B score being significantly different than the sum of the A and B scores.
