http://bugzilla.spamassassin.org/show_bug.cgi?id=3439

[EMAIL PROTECTED] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED



------- Additional Comments From [EMAIL PROTECTED]  2004-11-07 22:48 -------
hrm.  well, the problem was two fold, but after fixing the code bugs, and 
making sure all the hits were 
valid FPs ...  the results really suck.

There are 4 different ways to get URIs in from HTML parsing, src=, background=, 
href=, and action= 
(see HTML::html_uri for more details).  I setup some test rules for each type, 
and one for the total.  src 
is the best spam source via S/O, but has a very low hit rate.  everything else 
hits more on ham -- I have 
no idea why they do it, but there are newsletters that do this for no apparent 
reason:  '<a 
href="">Copyright</a>' (that was CNET, BTW...)  I'm guessing whatever their 
macro/rewrite/text vs 
html editors are, they don't pay attention to when blank URIs are used.

results from last 90 days, ~120k mails:

  0.085   0.0912   0.0151    0.858   1.00    0.01  T_EMPTY_URI_SRC
  0.293   0.2596   0.6653    0.281   0.33    0.01  T_EMPTY_URI
  0.157   0.1221   0.5443    0.183   0.33    0.01  T_EMPTY_URI_BG
  0.055   0.0537   0.0756    0.415   0.00    0.01  T_EMPTY_URI_HREF
  0.010   0.0087   0.0302    0.224   0.00    0.01  T_EMPTY_URI_ACTION

The new fix and rules are committed, r56908.  We can see how it works for 
everyone else, but judging 
from my results, this really sucks as a spam sign due to the large number of 
legit newsletters which do 
this.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Reply via email to