Win32Transcoder uses "best-fit" algorithm causing data loss
-----------------------------------------------------------

                 Key: XERCESC-1813
                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
             Project: Xerces-C++
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 2.8.0
         Environment: Windows
            Reporter: Janusz Nykiel


Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling 
WinAPI WideCharToMultiByte function. It may transliterate characters according 
to built-in Windows pages which contain arbitrary rules, for example ∞ (the 
infinity symbol) is changed to 8 (the digit) when the target character set 
doesn't have it (see 
http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the 
original  character data may be lost. Sometimes the output XML may be outright 
malformed, for example when the output character set is ISO-8859-2 and the 
input XML contains the U+00AB  («) and U+00BB (») characters - double angle 
quotation marks - which are transliterated to < and >, respectively.

WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - 
WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag 
to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo 
methods of the Win32Transcoder fixes the problem.

The documentation for WideCharToMultiByte states:
"
For strings that require validation, such as file, resource, and user names, 
the application should always use the WC_NO_BEST_FIT_CHARS flag with 
WideCharToMultiByte. This flag prevents the function from mapping characters to 
characters that appear similar but have very different semantics. In some 
cases, the semantic change can be extreme. For example, the symbol for "∞" 
(infinity) maps to 8 (eight) in some code pages.
"


Example input XML:
<?xml version='1.0' encoding='windows-1250' ?>
<test>zażółć gęślą «jaźń»</test>

Expected output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą &#xAB;jaźń&#xBB;</test>

Actual output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą <jaźń></test>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to