https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
Bug ID: 7126
Summary: Incorrect character set detections by
normalize_charset
Product: Spamassassin
Version: 3.4.0
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: P2
Component: Libraries
Assignee: [email protected]
Reporter: [email protected]
Noticing that several of our local mail messages are considered by
MS::Message::Node::_normalize() / normalize_charset as being written
in far-East character sets and decoded as such, which clearly does not
make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2),
I have put up an alternative reference implementation of _normalize()
and compared the results of the two, while manually checking the
reported differences.
In our case the one-day statistics shows that more than 8 % of
decisions taken by _normalize() were wrong. The most common
differences were:
- decoded as big5 (should be decoded as iso-8859-2)
- decoded as euc-kr (should be decoded as utf-8)
- decoded as euc-jp (should be decoded as utf-8)
- decoded as shift_jis (should be decoded as windows-1252)
- decoded as utf-8 (should be decoded as windows-1252)
- not decoded (should be decoded as gb2312)
- not decoded (should be decoded as gbk)
- not decoded (should be decoded as utf-8)
The source of the problem in my opinion is that the existing
_normalize() puts too much reliance on Encode::Detect::Detector
and the underlying "Mozilla's universal charset detector",
instead of trusting a declared character set (in a Content-Type),
and falling back to guesswork only when the declared character
set seems inconsistent with actual contents of a message part.
While relying primarily on guesswork may have made good sense
ten years ago, and probably still produces sensible results
in the far-East (as it errs on the side of far-Eastern character
sets), nowadays when UTF-8 is much more widespread, in my
opinion the logic is now flawed.
--
You are receiving this mail because:
You are the assignee for the bug.