At 16:08 2001-08-31 -0600, Rodney Wines wrote:
>[...]The problem is, the spammer uses really convoluted URL's that
>Samspade can't make any sense of.  Here is an example.
>
><a href=3D"http://www.signal.boost.net.co.fr&#20;&#2=
>;&#20;&#5;&#20;|https.intba.bzah.com/cell/booster/order.html">

I there there's five levels of possible obfuscation here:

1) MIME encoding.  That's what turned the "=" in the "href="
into a "href=3d", and which broke "&2;" as "&2=[newline];".
Ideally this is all resolved by a MIME-aware mailreader.

2) use of entity references (&foo;, &#num, &#xhexnum;).  These
can be resolved with HTML::Entities.

3) use of Javascript convolution (none here).  Infinite
potential for convolution here, and good luck getting around it.
I think the only thing you can bank on is the near-total laziness
and stupidity of spammers.

4) whatever this weirdness is with the &#20;&#2;&#20;&#5;&#20;
-- I've no idea how these characters are supposed to have any
special meaning, but apparently they do.  Or do they?

5) URL-encoding.  There's no URLencoding here, but I've seen
plenty of it before.  It's pretty much resolvable with URI.pm:

use URI;
sub obf { local $_ = $_[0]; s/([^\/])/sprintf '%%%2x', ord $1/eg; $_ }
my $obf = 'http://' . obf('www.perl.com/pub/a/2001/08/27/bjornstad.html');

print "obf: $obf\n";
my $x = URI->new($obf);
print "normalized: ", $x, "\n";
print "canonical: ", $x->canonical, "\n";

Output (wrapped for readability):

obf: http://%77%77%77%2e%70%65%72%6c%2e%63%6f%6d/%70%75%62/%61/%32
  %30%30%31/%30%38/%32%37/%62%6a%6f%72%6e%73%74%61%64%2e%68%74%6d%6c
normalized: http://%77%77%77%2e%70%65%72%6c%2e%63%6f%6d/%70%75%62/%61/%32
  %30%30%31/%30%38/%32%37/%62%6a%6f%72%6e%73%74%61%64%2e%68%74%6d%6c
canonical: http://www.perl.com/pub/a/2%30%301/%308/27/bjornstad.html


That's with URI.pm 1.11.  Hm, odd that "%32%30%30%31" canonizes as
"2%30%301", not "2001".  Gisle?


--
Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/

Reply via email to