[Bug 3976] [RFE] Invisible URIs should tend to be ignored

bugzilla-daemon 23 Feb 2005 17:11:37 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3976






------- Additional Comments From [EMAIL PROTECTED]  2005-02-23 09:11 -------
Subject: Re:  [RFE] Invisible URIs should tend to be ignored

> Could we possibly ignore invisble URIs instead of processing them, on the 
> theory
> that they're pretty much useless?  Recent spams have been sited by Dave Funk
> having 50-100 invisible URIs like: <A 
> href="http://garbage.sitename.tld";></A>. 
> Because we are testing every URI even invisible ones, the actual clickable 
> spam
> payload site URI is missed.

That's pretty much what my patch does.  Any anchor tags which have
nothing visible in the text part are "ignored".  Right now, as Quinlan
pointed out, we pick randomly from the list.

To clarify, right now we have 1 way of getting the uris out of the message,
PMS::get_uri_list().  Everything calls this, including the urirbl plugin.
So currently, there's no way to know after calling the function how the URI
was found.  You just know it existed somewhere/somehow in the message.

My patch right now just ignores (via M::SA::HTML) "blank" URIs (anything our
renderer says is blank, ie: nothing, comments, etc.)

What I'm thinking of doing is making a new API for the whole thing.
Basically, have M::SA::HTML create different arrays for the different
ways to get a URI (there's ~10 IIRC).  get_uri_list() will simply
concat all these arrays together and return it as it does now.
get_detailed_uri_list() will return a hash of:

{
        'a' => ('uri1' => 'uri1 text', 'uri2' => 'uri2 text', ...),
        'a_blank' => ('uri3' => 'uri3 text', 'uri4' => 'uri4 text', ...),
        'form' => ('uri5'),
        ...
}

Which can then be used as desired.  (things like 'img' may want to get
parameters about the image, but...)

> Jeff, I think the code does do that.  it also exposes a rule for blank hrefs, 
> as
> a bonus -- it's that which is FPing.

Right.

> Theo -- it occurred to me that matching (# of blank hrefs) / (message length)
> might work better, since most of these spams are short and the hams long, I 
> think.

Hrm.  Have to check that out. :)

> Right now, we do the fairly safe thing of choosing the maximum number of
> domains randomly from the list.  Eliminating invisible ones would be a
> potential disaster since spammers might be able to figure out a way to
> make domains appear to be invisible to our renderer, but be VISIBLE in
> some mail programs.

True, but that'll more than likely cause us other issues.

> I'd favor an approach that would give slightly better odds (like 2-to-1)
> for more visible URLs over less visible URLs, but I think a larger shift
> would be too juicy a reward and possibly lead to a lesser disaster.

In the new scheme (above), this is possible since the plugin can simply get
the list of all URIs and then do whatever it wants based on where it came
from.  Basically:

@list = @a;
while @list > $max
  <remove 1 randomly from @list>
while @list < $max
  <pick 1 randomly from "everything else but a and a_blank">





------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 3976] [RFE] Invisible URIs should tend to be ignored

Reply via email to