[Bug 8272] A HREF with UTF-8 host name invisible to SA

bugzilla-daemon Fri, 02 Aug 2024 14:42:52 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8272


--- Comment #6 from Sidney Markowitz <sid...@sidney.com> ---
There is a further problem revealed by this test case after I strip out the
http://user:pass@ prefix.

The text/html base64 section is declared to have UTF-8 charset. When the
base64-decoded result is then charset-decoded in Node.pm _normalize(), that
fails because there is a byte that is invalid UTF-8.

 dbg: message: failed decoding as charset UTF-8, declared UTF-8 (UTF-8 "\\xB1"
does not map to Unicode)
 dbg: message: decoded as last-resort charset Windows-1252, declared UTF-8

When the string is decoded as charset Windows-1252 all the multi-byte UTF-8
characters that are in the URL turn into a mess of various one-byte non-ASCII
characters, destroying the URL parsing.

The smallest change I can think of to fix this is in the code that now says

 elsif ($tried_utf8 && $chset eq 'UTF-8') {
 # was already tried initially, no point doing again
 }

I propose changing it to try decoding with UTF-8 again, but not strict, i.e.,
without the FB_CROAK flag, so it can succeed even when there are some non-UTF-8
bytes that will be mis-decoded.

What do people think? Any alternative suggestions?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8272] A HREF with UTF-8 host name invisible to SA

Reply via email to