[ 
https://issues.apache.org/jira/browse/UIMA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644477#comment-13644477
 ] 

Marshall Schor commented on UIMA-2849:
--------------------------------------

http://www.w3.org/International/questions/qa-controls specifies legal / illegal 
XML character codes; these vary depending on what "version" of XML is being 
used.  There are issues moving to 1.1.  See 
https://issues.apache.org/jira/browse/UIMA-387 for a long discussion of how the 
current design evolved.

(Silently) removing illegal characters (replacing them with blanks, or deleting 
them) was previously considered, but it was felt it was better to alert the 
user to this issue, because this kind of action could cause errors downstream 
in user's code.

Escaping these if XML 1.1 is being used is not a complete fix, since the x00 
character cannot be escaped.

Checking all string values for invalid XML 1.x character encodings, at creation 
time, seems expensive.

One possible improvement would be to issue a better error message when the bad 
character is found, to enable users to localize better where the source of the 
bad character is.
                
> XMLSerializer is not robust to ascii control characters 
> --------------------------------------------------------
>
>                 Key: UIMA-2849
>                 URL: https://issues.apache.org/jira/browse/UIMA-2849
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.4.0SDK
>            Reporter: Matthew Hatem
>
> If any strings in the CAS contain an ascii control character the 
> XMLSerializer fails with exception below.  XMLSerializer appears to be 
> escaping other invalid XML characters like '&' and '<'.  Perhaps it would be 
> appropriate to remove control characters (or escape these characters as well 
> in the case of XML 1.1).
> Workaround is to ensure all strings stored in the CAS do not contain ascii 
> control characters.  
> org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: , 
> 0x1c
>       at 
> org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
>       at 
> org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1516)
>       at 
> org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1496)
>       at bugs.UimaXMIBug.writeXmi(UimaXMIBug.java:68)
>       at bugs.UimaXMIBug.main(UimaXMIBug.java:38)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to