[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Sat, 14 Feb 2015 05:48:07 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #15 from Mark Martinec <[email protected]> ---
> > normalize_charset ( 0 | 1) (default: 0)
> >   Whether to detect character sets and normalize message content to
> >   Unicode. Requires the Encode::Detect module, HTML::Parser version
> >   3.46 or later, and Perl 5.8.5 or later.
>
> I need to update that text a bit. The Encode::Detect is no longer a
> requirement (just optional bonus), and the result is in UTF-8 bytes,
> not Unicode characters.


Actually I have already updated that text (r1655758, 2015-01-29,
Bug 7126), AXB was looking at an older version. The man page currently
states:

normalize_charset ( 0 | 1)        (default: 0)
  Whether to decode non- UTF-8 and non-ASCII textual parts and recode
  them to UTF-8 before the text is given over to rules processing.
  The character set used for attempted decoding is primarily based on
  a declared character set in a Content-Type header, but if the
  decoding attempt fails a module Encode::Detect::Detector is
  consulted (if available) to provide a guess based on the actual
  text, and decoding is re-attempted. Even if the option is enabled
  no unnecessary decoding and re-encoding work is done when possible
  (like with an all-ASCII text with a US-ASCII or extended ASCII
  character set declaration, e.g. UTF-8 or ISO-8859-nn or Windows-nnnn).

  Unicode support in old versions of perl or in a core module Encode
  is likely to be buggy in places, so if the normalize_charset
  function is enabled it is advised to stick to more recent versions
  of perl (preferably 5.12 or later). The module
  Encode::Detect::Detector is optional, when necessary it will be
  used if it is available.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to