[Bug 7126] New: Incorrect character set detections by normalize_charset

bugzilla-daemon Thu, 29 Jan 2015 07:03:02 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126


            Bug ID: 7126
           Summary: Incorrect character set detections by
                    normalize_charset
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]

Noticing that several of our local mail messages are considered by
MS::Message::Node::_normalize() / normalize_charset as being written
in far-East character sets and decoded as such, which clearly does not
make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2),
I have put up an alternative reference implementation of _normalize()
and compared the results of the two, while manually checking the
reported differences.

In our case the one-day statistics shows that more than 8 % of
decisions taken by _normalize() were wrong. The most common
differences were:

- decoded as big5      (should be decoded as iso-8859-2)
- decoded as euc-kr    (should be decoded as utf-8)
- decoded as euc-jp    (should be decoded as utf-8)
- decoded as shift_jis (should be decoded as windows-1252)
- decoded as utf-8     (should be decoded as windows-1252)
- not decoded          (should be decoded as gb2312)
- not decoded          (should be decoded as gbk)
- not decoded          (should be decoded as utf-8)

The source of the problem in my opinion is that the existing
_normalize() puts too much reliance on Encode::Detect::Detector
and the underlying "Mozilla's universal charset detector",
instead of trusting a declared character set (in a Content-Type),
and falling back to guesswork only when the declared character
set seems inconsistent with actual contents of a message part.

While relying primarily on guesswork may have made good sense
ten years ago, and probably still produces sensible results
in the far-East (as it errs on the side of far-Eastern character
sets), nowadays when UTF-8 is much more widespread, in my
opinion the logic is now flawed.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] New: Incorrect character set detections by normalize_charset

Reply via email to