[ 
https://issues.apache.org/jira/browse/XERCESC-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322064#comment-16322064
 ] 

Andreas Krantz commented on XERCESC-1854:
-----------------------------------------

The method {{DOMLSSerializerImpl::ensureValidString}} is introduced as a fix in 
3.2.0 but there is a wrong assumtion in it.
With the new implementation the area of Surrogate area xD800 - xDFFF is marked 
as invalid for XMLCh which ist utf16.
The problem is that the valid area of unicode area x10000-x10FFFF is encoded 
using those areas.

x10FFFF becomes xDBFF,xDFFF

The surrogates are handled correctly by the reader code but now it is no longer 
possible to save back the read DOM.
e.g. const std::u16string xmlString{ u"<?xml version=\"1.0\" 
encoding=\"UTF-16\" standalone=\"yes\" ?><root>\U0010FFFF</root>" };

This potentially breaks our format if changing to 3.2.0

I am not sure if it is possible to reopen this issue for an fix in 3.2.1???

A closer look to 
{{inline bool XMLChar1_0::isXMLChar(const XMLCh toCheck, const XMLCh toCheck2)}}

shows that it has two parameters to handle surrogates. But the 
ensureValidString must handle the leading surrogate and act using it.

> Serialization does not detect invalid XML characters
> ----------------------------------------------------
>
>                 Key: XERCESC-1854
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1854
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: DOM
>    Affects Versions: 3.0.1
>            Reporter: Boris Kolpackov
>            Assignee: Alberto Massari
>             Fix For: 3.2.0
>
>         Attachments: test.cxx
>
>
> The attached test case serializes an invalid XML 1.0 document that contains a 
> character with value 0x04. See http://www.w3.org/TR/REC-xml/#NT-Char for the 
> list of valid characters in an XML 1.0 document.
> I've done some digging and it seems that XMLFormatter should check for this. 
> In fast, there is already code for XML 1.1 that checks for these control 
> characters since they need to be escaped in 1.1. It looks like we need to 
> check for invalid characters when in the 1.0 mode. There is the 
> XMLChar1_0::isXMLChar() function which can presumably be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to