[
https://issues.apache.org/jira/browse/LANG-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164093#comment-14164093
]
Robert Sussland commented on LANG-1042:
---------------------------------------
I was not expecting this data to be safe for arbitrary injection into html.
String escaping has a well-defined meaning -- the output of this function
should not be able break out of a string data context, because all characters
that could be interpreted by the html parser as closing out the string data
context are escaped.
This is exactly the string escaping behavior of other methods in this package,
and what is commonly known as string escaping.
c.f.
https://tomcat.apache.org/taglibs/standard/apidocs/org/apache/taglibs/standard/util/EscapeXML.html
In terms of which characters need to be escaped, HTML as well as XML only allow
string data in two places: attribute values and text nodes. Control characters
that denote start/end of attribute values and text nodes are well-defined and
finite: single/double quote for attribute values and brackets <, >. This
assumes that the template as a whole is valid html.
Additionally, the escaping symbol should also be escaped so that the method is
a bijection and an unescaping method is possible.
Finally, there is little value to a method that performs only html entity
encoding -- unless you are building an html entity encoding demonstration
method. The list of html entities was selected as a convenience so that html
developers would not need to memorize ascii/unicode values for commonly used
symbols such as e-accent, the less than sign, or the euro sign. The list of
html entities is not the list of html control characters and is not relevant
for an html string escaping method.
> StringEscapeUtils.escapeHtml() does not escape single quote
> -----------------------------------------------------------
>
> Key: LANG-1042
> URL: https://issues.apache.org/jira/browse/LANG-1042
> Project: Commons Lang
> Issue Type: Bug
> Reporter: Robert Sussland
> Priority: Critical
>
> The String Escape Utils should ensure that encoded data cannot escape from a
> string. However in HTML (starting with 1.0 and until the present), attribute
> values may be denoted by either single or double quotes. Therefore single
> quotes need to be escaped just as much as double quotes.
> From the standard: http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
> {quote}
> By default, SGML requires that all attribute values be delimited using either
> double quotation marks (ASCII decimal 34) or single quotation marks (ASCII
> decimal 39). Single quote marks can be included within the attribute value
> when the value is delimited by double quote marks, and vice versa. Authors
> may also use numeric character references to represent double quotes
> (&#34\;) and single quotes (&#39\;). For double quotes authors can
> also use the character entity reference &quot;.
> {quote}
> Note that there have been several bugs in the wild in which string encoders
> use this library under the hood, and as a result fail to properly escape html
> attributes in which user input is stored:
> <div title='<%=user_data%>'>Howdy</div>
> if user_data = ' onclick='payload' '
> then an attacker can inject their code into the page even if the developer is
> using the string escape utils to escape the user string.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)