At 16:08 2001-08-31 -0600, Rodney Wines wrote:
>[...]The problem is, the spammer uses really convoluted URL's that
>Samspade can't make any sense of. Here is an example.
>
><a href=3D"http://www.signal.boost.net.co.fr=
>;|https.intba.bzah.com/cell/booster/order.html">
I there there's five levels of possible obfuscation here:
1) MIME encoding. That's what turned the "=" in the "href="
into a "href=3d", and which broke "&2;" as "&2=[newline];".
Ideally this is all resolved by a MIME-aware mailreader.
2) use of entity references (&foo;, &#num, &#xhexnum;). These
can be resolved with HTML::Entities.
3) use of Javascript convolution (none here). Infinite
potential for convolution here, and good luck getting around it.
I think the only thing you can bank on is the near-total laziness
and stupidity of spammers.
4) whatever this weirdness is with the 
-- I've no idea how these characters are supposed to have any
special meaning, but apparently they do. Or do they?
5) URL-encoding. There's no URLencoding here, but I've seen
plenty of it before. It's pretty much resolvable with URI.pm:
use URI;
sub obf { local $_ = $_[0]; s/([^\/])/sprintf '%%%2x', ord $1/eg; $_ }
my $obf = 'http://' . obf('www.perl.com/pub/a/2001/08/27/bjornstad.html');
print "obf: $obf\n";
my $x = URI->new($obf);
print "normalized: ", $x, "\n";
print "canonical: ", $x->canonical, "\n";
Output (wrapped for readability):
obf: http://%77%77%77%2e%70%65%72%6c%2e%63%6f%6d/%70%75%62/%61/%32
%30%30%31/%30%38/%32%37/%62%6a%6f%72%6e%73%74%61%64%2e%68%74%6d%6c
normalized: http://%77%77%77%2e%70%65%72%6c%2e%63%6f%6d/%70%75%62/%61/%32
%30%30%31/%30%38/%32%37/%62%6a%6f%72%6e%73%74%61%64%2e%68%74%6d%6c
canonical: http://www.perl.com/pub/a/2%30%301/%308/27/bjornstad.html
That's with URI.pm 1.11. Hm, odd that "%32%30%30%31" canonizes as
"2%30%301", not "2001". Gisle?
--
Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/