[jira] [Comment Edited] (XALANJ-2560) ToXMLStream does not support unicode supplementary characters

Joe Kesselman (Jira) Wed, 10 Jan 2024 13:55:04 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805301#comment-17805301
 ]


Joe Kesselman edited comment on XALANJ-2560 at 1/10/24 9:54 PM:
----------------------------------------------------------------

CAVEAT: This is getting into spec lawyer territory for XML and HTML, and it's 
been quite a while since I've dug down to that level. The time was when I could 
answer it offhand, but I'm still swapping all of this back into wetware.

Checking a few Unicode representation converters, unicode character 128187, or 
in hex 1F4BB.

In UTF-16, that should indeed be encoded as the surrogate pair D83D DCBB. In 
UTF-8 it's the bytes F0 9F 92 BB.

Of course if you're outputting XML or HTML, the numeric character reference 
(NCR) &amp;#x1F4BB should indeed be equivalent to the literal character. The 
question is how that character should be flowing through the system, right?

Internally, Xalan uses Java characters, which are UTF-16. So its representation 
of this glyph is indeed the surrogate pair. On input, that refactoring gets 
done during reading of the stream, based on which encoding that stream has been 
told to expect.

I believe Java input streams normally assume UTF-8 unless told otherwise, which 
as noted above requires a different sequence of bytes than UTF-16 would. And 
they handle this conversion to the internal characters for us before Xalan 
itself ever sees the data. So it's clear what the behavior should be for raw 
bytes if you know the encoding. If you don't know the encoding but know that 
it's UTF-something, I think Unicode was {*}{{*}}supposed{{*}}{*} to be designed 
such that the first byte indicated how to read the following bytes; I'd have to 
check that but it would nail down the bytestream either way.

The question becomes one of whether NCRs expressing a surrogate sequence are 
expected to be converted by an XML or HTML parser. That would have to happen 
{*}{{*}}after{{*}}{*} the stream was read and the references were expanded.

I _suspect_ the intended answer is no, and that we should instead either be 
outputting raw bytes as appropriate for the encoding, assuming the encoding is 
UTF-* and can handle that character this way, or – for XML and HTML 
{_}*only*{_}, since they're the only ones who have defined NCRs – a single NCR 
which expresses the final value and leaves the question of appropriate internal 
encoding/representation for the receiving application to figure out.

In other words, if we're going to write it out as a numeric character reference 
(so it survives passing through non-unicode layers), I think you're right that 
sending it as &amp;#x1F4BB is certainly safer than sending it as NCRs for each 
of the code units. A code unit is not a character.

This would add some cost to serialization, since now someone has to recognize 
that a sequence of Java chars may contain things which even UTF-16 requires a 
surrogate code for, and convert those pairs from UTF-16 to a single character 
number. That means checking ever Java char for whether it introduces a 
surrogate pair and sending those and their following bytes through an alternate 
serialization mechanism.

Which may not be all that bad. The HTML and XML serializers already need to 
check if characters have to be output as NCRs. I believe that would already 
recognize the first surrogate,and the additional work would apply  only after 
that's determined to be true. So 99.44% of the time there should be no new cost.

Conclusion: The current behavior probably _is_ a bug, and the suggested 
replacement behavior appears to be appropriate.


was (Author: JIRAUSER285361):
CAVEAT: This is getting into spec lawyer territory for XML and HTML, and it's 
been quite a while since I've dug down to that level. The time was when I could 
answer it offhand, but I'm still swapping all of this back into wetware.

Checking a few Unicode representation converters, unicode character 128187, or 
in hex 1F4BB.

In UTF-16, that should indeed be encoded as the surrogate pair D83D DCBB. In 
UTF-8 it's the bytes F0 9F 92 BB.

Of course if you're outputting XML or HTML, the numeric character reference 
(NCR) &#x1F4BB; should indeed be equivalent to the literal character. The 
question is how that character should be flowing through the system, right?

Internally, Xalan uses Java characters, which are UTF-16. So its representation 
of this glyph is indeed the surrogate pair. On input, that refactoring gets 
done during reading of the stream, based on which encoding that stream has been 
told to expect.

I believe Java input streams normally assume UTF-8 unless told otherwise, which 
as noted above requires a different sequence of bytes than UTF-16 would. And 
they handle this conversion to the internal characters for us before Xalan 
itself ever sees the data. So it's clear what the behavior should be for raw 
bytes if you know the encoding. If you don't know the encoding but know that 
it's UTF-something, I think Unicode was *{*}supposed{*}* to be designed such 
that the first byte indicated how to read the following bytes; I'd have to 
check that but it would nail down the bytestream either way.

The question becomes one of whether NCRs expressing a surrogate sequence are 
expected to be converted by an XML or HTML parser. That would have to happen 
*{*}after{*}* the stream was read and the references were expanded.

I _suspect_ the intended answer is no, and that we should instead either be 
outputting raw bytes as appropriate for the encoding, assuming the encoding is 
UTF-* and can handle that character this way, or – for XML and HTML 
\{*}{*}only{*}{*}, since they're the only ones who have defined NCRs – a single 
NCR which expresses the final value and leaves the question of appropriate 
internal encoding/representation for the receiving application to figure out.

In other words, if we're going to write it out as a numeric character reference 
(so it survives passing through non-unicode layers), I think you're right that 
sending it as &amp;#x1F4BB is certainly safer than sending it as NCRs for each 
of the code units. A code unit is not a character.

This would add some cost to serialization, since now someone has to recognize 
that a sequence of Java chars may contain things which even UTF-16 requires a 
surrogate code for, and convert those pairs from UTF-16 to a single character 
number. That means checking ever Java char for whether it introduces a 
surrogate pair and sending those and their following bytes through an alternate 
serialization mechanism.

Which may not be all that bad. The HTML and XML serializers already need to 
check if characters have to be output as NCRs. I believe that would already 
recognize the first surrogate,and the additional work would apply  only after 
that's determined to be true. So 99.44% of the time there should be no new cost.

Conclusion: The current behavior probably _is_ a bug, and the suggested 
replacement behavior appears to be appropriate.

> ToXMLStream does not support unicode supplementary characters
> -------------------------------------------------------------
>
>                 Key: XALANJ-2560
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2560
>             Project: XalanJ2
>          Issue Type: Bug
>      Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization
>    Affects Versions: 2.7.1
>         Environment: Xalan 2.7.1 serializer.
> Tested on Ubuntu 12.04 with Oracle JDK 1.7.0_05.
>            Reporter: Damien Guillaume
>            Assignee: Joe Kesselman
>            Priority: Major
>              Labels: serialization, unicode
>
> org.apache.xml.serializer.ToXMLStream (which extends ToStream) does not 
> support serialization of unicode supplementary characters such as U+1D49C. It 
> creates invalid characters entities like "&#55349;&#56476;" instead of 
> "&#119964;" (or F0 9D 92 9C with UTF-8). ToXMLStream is used by LSSerializer 
> when Xalan's serializer is on the classpath.
> org.apache.xml.serialize.DOMSerializerImpl (included in Xerces) does not have 
> this problem, but it is deprecated since Xerces 2.9.0, so this is a 
> regression.
> See 
> http://stackoverflow.com/questions/11952289/serializing-supplementary-unicode-characters-into-xml-documents-with-java
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Comment Edited] (XALANJ-2560) ToXMLStream does not support unicode supplementary characters

Reply via email to