Hi everyone,

We're having a debate (in the comment section of this PR
<https://github.com/apache/commons-text/pull/310>) on the legitimacy of
unescaping semicolon-less numerical character entities in Commons-Text.

The possibility to unescape such entities has long been part of the
library, via the semiColonOptional
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
option
in the NumericEntityUnescaper class.

While testing this option, I discovered a small bug which allows to bypass
the unescaper.
A string like this: *<iframe src="&#106avascript:alert(1)">* is ignored by
the unescaper, because even though this entity is a decimal one, the
algorithm searches for hexidecimal characters in all cases and includes the
"a" after the "6".
This prompted me to fix it in this commit
<https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999>
and
open the PR.

However, as mentioned earlier, there is a debate on the legitimacy of
unescaping semicolon-less from the beginning.

The point of garydgregory is that such entities do not form part of the
HTML specification and as such Commons-Text should not consider it.

My point and kinow's however, is that these semicolon-less entities are
unescaped by virtually every modern browsers (tested with Chrome, Firefox,
Edge and Safari) and that Commos-Text could reasonnably expect the library
to support them.

Also, I pointed that my fix only makes the unsecaping work correctly with
decimal entities, so in my opinion the PR shouldn't be blocked by the
debate.

What's your opinion about it ?

Thanks !
Richard

https://github.com/apache/commons-text/pull/310

Reply via email to