Win32Transcoder uses "best-fit" algorithm causing data loss
-----------------------------------------------------------
Key: XERCESC-1813
URL: https://issues.apache.org/jira/browse/XERCESC-1813
Project: Xerces-C++
Issue Type: Bug
Components: Utilities
Affects Versions: 2.8.0
Environment: Windows
Reporter: Janusz Nykiel
Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling
WinAPI WideCharToMultiByte function. It may transliterate characters according
to built-in Windows pages which contain arbitrary rules, for example ∞ (the
infinity symbol) is changed to 8 (the digit) when the target character set
doesn't have it (see
http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the
original character data may be lost. Sometimes the output XML may be outright
malformed, for example when the output character set is ISO-8859-2 and the
input XML contains the U+00AB («) and U+00BB (») characters - double angle
quotation marks - which are transliterated to < and >, respectively.
WideCharToMultiByte has a flag controlling the "best-fit" algorithm use -
WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag
to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo
methods of the Win32Transcoder fixes the problem.
The documentation for WideCharToMultiByte states:
"
For strings that require validation, such as file, resource, and user names,
the application should always use the WC_NO_BEST_FIT_CHARS flag with
WideCharToMultiByte. This flag prevents the function from mapping characters to
characters that appear similar but have very different semantics. In some
cases, the semantic change can be extreme. For example, the symbol for "∞"
(infinity) maps to 8 (eight) in some code pages.
"
Example input XML:
<?xml version='1.0' encoding='windows-1250' ?>
<test>zażółć gęślą «jaźń»</test>
Expected output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą «jaźń»</test>
Actual output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]