[
https://issues.apache.org/jira/browse/LANG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benedikt Ritter updated LANG-955:
---------------------------------
Summary: Add methods for removing all invalid characters according to XML
1.0 and XML 1.1 in an input string to StringEscapeUtils (was:
StringEscapeUtils.escapeXml doesn't remove invalid characters)
> Add methods for removing all invalid characters according to XML 1.0 and XML
> 1.1 in an input string to StringEscapeUtils
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: LANG-955
> URL: https://issues.apache.org/jira/browse/LANG-955
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.*
> Affects Versions: 3.1
> Environment: Ubuntu 13.10
> Reporter: Adam Hooper
> Assignee: Benedikt Ritter
> Labels: xml
> Fix For: 3.3, Review Patch
>
>
> escapeXml lets non-text characters pass through into XML files:
> {code}
> scala>
> org.apache.commons.lang3.StringEscapeUtils.escapeXml("\u0004").codePointAt(0)
> res4: Int = 4
> {code}
> I would expect the result to be an exception -- either from StringEscapeUtils
> (refusing to encode it) or, preferably, from String.codePointAt, complaining
> that the string is empty. \u0004 is not a valid character in XML 1.0, and
> there is no way to represent it in an XML document -- not even by escaping it.
> Wikipedia summarizes the characters that are not allowed in XML -- even after
> escaping: http://en.wikipedia.org/wiki/Valid_characters_in_XML. The reason
> for disallowing them: XML is a text interchange format, and control
> characters are not text.
> If StringEscapeUtils.escapeXml allows invalid XML characters through --
> whether escaped or not -- it generates invalid XML. Valid XML parsers will
> refuse to read such files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)