https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #6 from Mark Martinec <[email protected]> --- Some interesting statistics, collected from 100.000 textual mail parts as seen in two working days at our site. A single mail message can be counted as more than one part (e.g. text/plain + text/html in case of multipart/alternative), so the number of mail messages analyzed is slightly less than half that much. The debug messages were grepped by a ': message: .*charset' and grouped into the following groups: 11.1% true US-ASCII (kept unchanged) 67.5% valid UTF-8 as declared in Content-Type (kept unchanged) 0.2% valid UTF-8 as detected/guessed (kept unchanged) 20.8% decoded (non- UTF-8) as declared in Content-Type 0.4% decoded (non- UTF-8) as detected/guessed The 'decoded' and 'as detected/guessed' only occur with a setting: normalize_charset 1 (otherwise these would just have been kept as unchanged octets / Mojibake). Summarizing the above further down yields: 11.1% true US-ASCII (kept unchanged) 67.6% is UTF-8 (kept unchanged) 21.3% decoded into UTF-8 (when normalize_charset is enabled) So, 67.6% is natively UTF-8, and 88.9% of textual parts end up as UTF-8 if normalize_charset is enabled. The remaining 11.1% of textual mail parts is just plain ASCII text. Interestingly, (while not directly comparable), our 88.9% UTF-8 figure corresponds closely to the 82.5% in "Usage of character encodings for websites" January 2015: "UTF-8 is used by 82.5% of all the websites whose character encoding we know." http://w3techs.com/technologies/overview/character_encoding/all -- You are receiving this mail because: You are the assignee for the bug.
