https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114
--- Comment #12 from Justin Mason <j...@jmason.org> 2009-07-17 02:09:21 PST --- (In reply to comment #11) > Regexp::Assemble looks like the more interesting of the two, even if it's > easier for me to split() the regexp into pieces and then add() them to the RA > object. Sure enough, it was a quick edit (only 8 lines of code, and I think > the resulting code is cleaner anyway). My main worry is that the optimization > is more for regexp size than for performance. well, it's testable; take a small selection of "test" Received IPs from your corpus, put them in a perl script, then use Benchmark: $_ = "1.2.8.9"; use Benchmark qw(:all); timethese(-2, { 'R:A' => sub { /\b(?:1\.(?:2\.(?:3\.(?:4|5|6)|7\.(?:8|9))))\b/ }, 'plain' => sub { /\b(?:1\.2\.3\.4|1\.2\.3\.5|1\.2\.3\.6|1\.2\.7\.8|1\.2\.7\.9)\b/ }, }); that'll produce a nice little chart telling you which one is faster. (in that really basic example, it's "plain" if the test IP appears in the list, or "R:A" if it doesn't, giving a demo of why you want to use better test data if possible.) > I've also merged TOP10+TOP20+TOP100+TOP200 into TOP200, which makes its > definition 2751 characters with a slew of nesting after reduction via > Regexp::Assemble, which is a thousand more than when it was just a list of > SpamCop's 101-200 top offenders. > > I'm going to sit on it for a few days before pushing it here just in case it > doesn't work well (though it's live on my sa-update channel). Sounds good. +1 > Any comments on my conclusions when I said this? > > Additionally, recall that I assigned a very small number of points to the > > CIDR8 rules as I was fully expecting some FPs. I've even scored them a > > little lower just in case, clocking in at 0.6 for TOP_CIDR8 and 0.2 for > > CIDR8. Perhaps I'm not reading the score-map right, but 95.77% of the ham > > hits scored under 3.999 (84.14% scored under 0.999), so a small bump won't > > make a difference. Given the current data, T_KHOP_SC_CIDR8 would only add > > points to ONE false positive hit (0.21% of the ham) and even if scored at > > 2.0, it would create 23 FPs (4.87% of the 0.8152% of the hams, which is to > > say 0.0397% of the ham). Scoring it 1.0 or less wouldn't actually have > > added any FPs. I'm not sure. http://ruleqa.spamassassin.org/20090714-r793817-n/T_KHOP_SC_CIDR8/detail scoremap ham: 0 80.68% 668 ******************************** scoremap ham: 1 8.09% 67 *** scoremap ham: 2 5.56% 46 ** scoremap ham: 3 1.69% 14 scoremap ham: 4 3.02% 25 * scoremap ham: 6 0.85% 7 scoremap ham: 8 0.12% 1 http://ruleqa.spamassassin.org/20090714-r793817-n/KHOP_SC_TOP_CIDR8/detail scoremap ham: 0 66.49% 252 ************************** scoremap ham: 1 21.90% 83 ******** scoremap ham: 2 2.37% 9 scoremap ham: 3 1.58% 6 scoremap ham: 4 6.33% 24 ** scoremap ham: 5 0.53% 2 scoremap ham: 6 0.79% 3 The danger is those hits around 4. They may be _just_ under 5 points, in which case those will be tipped over the FP threshold very easily. In addition, that kind of "damned by association" rule will be very contentious with people who find their mail is being marked as spam; there's not much they can do about being in the same /8 as a bad spammer. I'd prefer to "lock them down" at low values. We could wait and see what Daryl's rescoring code makes of it... although that doesn't seem to be running at the moment. --j. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.