[jira] [Commented] (LUCENE-5191) SimpleHTMLEncoder in Highlighter module breaks Unicode outside BMP

Uwe Schindler (JIRA) Thu, 29 Aug 2013 02:27:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753455#comment-13753455
 ]


Uwe Schindler commented on LUCENE-5191:
---------------------------------------

We have a variant of this code, recently added by Robert Muir into 
PostingsHighlighter's DefaultPassageFormatter.

This escapes a little bit more chars, with a reference to OWASP: 
[https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.231_-_HTML_Escape_Before_Inserting_Untrusted_Data_into_HTML_Element_Content]
 and 
[https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.232_-_Attribute_Escape_Before_Inserting_Untrusted_Data_into_HTML_Common_Attributes]

The code used here escapes any charis >127 and <255 according to the second 
rule, which is not needed here, because the escaped data is not included into 
HTML attributes which may be "unquoted". So for this only the first rule 
applies, in which it is enough to escape the 4 well-known escapes and also the 
forward slash + single quote ('). The latter two ones do not need to be escaped 
if used in text, but for safety we could include them.

In any case I would like to unify the different approaches of HTML escaping. As 
we are not working in unquoted attributes (we just encode floating HTML text), 
I would use Robert's code without the extra numeric escapes.

The official HTML4 spec (I used HTML4, the passage is the same for other HTML, 
see [http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2]): 

{quote}
Four character entity references deserve special mention since they are 
frequently used to escape special characters:

"&lt;" represents the < sign.
"&gt;" represents the > sign.
"&amp;" represents the & sign.
"&quot; represents the " mark.
Authors wishing to put the "<" character in text should use "&lt;" (ASCII 
decimal 60) to avoid possible confusion with the beginning of a tag (start tag 
open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in 
text instead of ">" to avoid problems with older user agents that incorrectly 
perceive this as the end of a tag (tag close delimiter) when it appears in 
quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion 
with the beginning of a character reference (entity reference open delimiter). 
Authors should also use "&amp;" in attribute values since character references 
are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encode instances of 
the double quote mark (") since that character may be used to delimit attribute 
values.
{quote}

Any comments?
                
> SimpleHTMLEncoder in Highlighter module breaks Unicode outside BMP
> ------------------------------------------------------------------
>
>                 Key: LUCENE-5191
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5191
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 5.0, 4.5
>
>         Attachments: LUCENE-5191.patch
>
>
> The highlighter provides a function to escape HTML, which does to much. To 
> create valid HTML only ", <, >, & must be escaped, everything else can kept 
> unescaped. The escaper unfortunately does also additionally escape everything 
> > 127, which is unneeded if your web site has the correct encoding. It also 
> produces huge amounts of HTML entities if used with eastern languages.
> This would not be a bugf if the escaping would be correct, but it isn't, it 
> escapes like that:
> {{result.append("\&#").append((int)ch).append(";");}}
> So it escapes not (as HTML needs) the unicode codepoint, instead it escapes 
> the UTF-16 char, which is incorrect, e.g. for our all-time favourite Deseret:
> U+10400 (deseret capital letter long i) would be escaped as 
> {{&\#55297;&\#56320;}} and not as {{&\#66560;}}.
> So we should remove the stupid encoding of chars > 127 which is simply 
> useless :-)
> See also: https://github.com/elasticsearch/elasticsearch/issues/3587

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5191) SimpleHTMLEncoder in Highlighter module breaks Unicode outside BMP

Reply via email to