http://bugzilla.spamassassin.org/show_bug.cgi?id=4052
------- Additional Comments From [EMAIL PROTECTED] 2005-01-03 11:04 ------- Hi -- these look very interesting, and I like the methodology! (I also notice that the recall/error rates have improved from the figures quoted in the presentations, according to the .cf file's comments; the current figures look very useful!) Would it be possible for you to sign and fax an Apache CLA so that we can incorporate these (or at least test them)? details are at: http://www.apache.org/licenses/#clas OK, a few questions: 1. In our experience, patterns which span 4 or more words, are often more effective at catching a small set of spam, but with very low false positive rates, than patterns which match only 1 or 2 words. Have you tried modifying the generator so that it generates longer patterns from the corpus? It would increase memory use in the generator, but should generate a smaller number of more-reliable rules that can supplement the shorter rules. This small set of long rules would then possibly warrant higher score values than the larger set of short rules. 2. We have poor support for decoding between character sets (e.g. converting all text strings in mails to UTF-8 where possible). Has this proved to be a noticeable issue for this ruleset? (Just wondering!) 3. Our default ruleset is not very good against Chinese mail in general, apparently missing a lot of spam and causing false positives on ham messages. It would be *very* useful if we could set up nightly mass-checks against a good Chinese-ham corpus, in order to avoid future FPs. There's two ways to do that -- either by one of the existing developers obtaining a (confidential) copy of the corpus and adding that to their collection if that's permissible, or if your group could set up a nightly mass-check as described here: http://wiki.apache.org/spamassassin/NightlyMassCheck If either would be possible, that would be really great ;) --j. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
