[Bug 7126] Incorrect character set detections by normalize_charset

bugzilla-daemon Tue, 03 Feb 2015 07:51:19 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126


--- Comment #6 from Mark Martinec <[email protected]> ---
Some interesting statistics, collected from 100.000 textual mail parts
as seen in two working days at our site. A single mail message can
be counted as more than one part (e.g. text/plain + text/html in case
of multipart/alternative), so the number of mail messages analyzed is
slightly less than half that much.

The debug messages were grepped by a ': message: .*charset' and
grouped into the following groups:

  11.1%  true US-ASCII (kept unchanged)
  67.5%  valid UTF-8 as declared in Content-Type (kept unchanged)
   0.2%  valid UTF-8 as detected/guessed (kept unchanged)
  20.8%  decoded (non- UTF-8) as declared in Content-Type
   0.4%  decoded (non- UTF-8) as detected/guessed

The 'decoded' and 'as detected/guessed' only occur with a setting:
  normalize_charset 1
(otherwise these would just have been kept as unchanged octets / Mojibake).

Summarizing the above further down yields:

  11.1%  true US-ASCII (kept unchanged)
  67.6%  is UTF-8      (kept unchanged)
  21.3%  decoded into UTF-8 (when normalize_charset is enabled)

So, 67.6% is natively UTF-8, and 88.9% of textual parts end up
as UTF-8 if normalize_charset is enabled. The remaining 11.1%
of textual mail parts is just plain ASCII text.


Interestingly, (while not directly comparable), our 88.9% UTF-8 figure
corresponds closely to the 82.5% in "Usage of character encodings
for websites" January 2015:
  "UTF-8 is used by 82.5% of all the websites whose character
   encoding we know."
http://w3techs.com/technologies/overview/character_encoding/all

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Reply via email to