problems detecting URIs embedded in JIS encoding

Simon McCorkindale Sun, 07 Aug 2005 23:50:43 -0700

Platform: FreeBSD 5.4-RC3
Perl: 5.8.6
SpamAssassin: 3.0.4

I'm a volunteer for the www.rbl.jp project and I think I've come across
a bug in SA. I searched for any previous posts of this bug but couldn't
find anything. I know this isn't the right place to post bugs but I want
to discuss my attempts to fix it.


The problem is when some Japanese characters from the JIS character set
immediately follow a URI then the URI is not detected properly.

The URL I used for testing is listed in our url.rbl.jp black list and
numerous others. It is http://www.j-*sine.com but with the * removed
(just to make sure this mail gets through the mailing list :-)

If there are any JIS characters immediately following the m at the end
if j-sine.com then what is extracted will be the http://www.j-*sine.com
plus a chunk of the JIS characters.

Hence, when SpamAssassin queries url.rbl.jp to see if this URL is
registered it gets a not-registered reply.

I had a hunt through the Perl code and did many test simulations and
managed to track the source of the problem down to PerMsgStatus.pm.
Between lines 1733 and 1745 of this file the regular expressions for
detecting URIs are defined. I'm not a wizard on regular expressions so a
lot of it's over the top for me.

Using my old friend od I tracked the culprit JIS character down. It
seems to be the ESC (hex 1B) character. I don't know much about JIS but
I'm guessing this is used to define the start of a string of JIS
characters.

On line 1735 of PerMsgStatus.pm there is the line:

my $unreserved = "A-Za-z0-9\Q$mark#\E\x00-\x08\x0b\x0c\x0e-\x1f";

so I modified it to:

my $unreserved = "A-Za-z0-9\Q$mark#\E\x00-\x08\x0b\x0c\x0e-\x1a\x1c-
\x1f";

so that \x1b isn't included and this seems to have solved the problem.

I think this is an ugly hack and probably breaking other stuff/going
against certain rules etc but I would like to hear anybody's ideas on
this dilemma.

Thanks in advance,
Simon.

problems detecting URIs embedded in JIS encoding

Reply via email to