Johannes Willnecker created XERCESC-2158:

             Summary: XMLUTF8Transcoder: One multibyte UTF8 character is 
swallowed from the srcData when the resulting surrogate pair does not fit in 
toFill at the end
                 Key: XERCESC-2158
             Project: Xerces-C++
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 3.2.2, 3.1.4
         Environment: OS independent: Linux (RedHat 7.5)/Windows 10

Compiler independent
            Reporter: Johannes Willnecker
         Attachments: UTF8.xml, xerces.patch

*Bug found in Xerces-C++ Version 3.1.4* (based on code reviews also newer 
versions are affected)


*How to reproduce:* Call SAX2Print for the attached UTF8.xml file "SAX2Print 
One chinese character is missing in the name attribute of the last but one 
Instance element.

*Fix:* The fix for this bug is included in the xerces.patch file.
In XMLUTF8Transcoder.cpp a check for this issue was already included but the 
that the bytes read are updated at the end of the loop was wrong.
The bytes read (bytesEaten) calculation is based on the srcPtr which was 
already updated when the check is made.
Therefore srcPtr needs to be repositioned in case the Surrogate pair does not 
fit into the toFill buffer.


*Contributor related:*

Author Name of the code being contributed: Johannes Willnecker

Employer: Siemens AG

I have the right to grant the copyright licenses for the contribution.

My employer has rights to the code that I have written. My employer gave me 
permission to contribute this code on its behalf.

I am not aware of any third-party license or other restrictions.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to