Re: supplementary characters emojis, etc turned ino surrogate pairs

Joseph Kessselman Wed, 10 Jan 2024 13:10:56 -0800

CAVEAT: This is getting into spec lawyer territory for XML and HTML, and it's been quite a while since I've dug down to that level. The time was when I could answer it offhand, but I'm still swapping all of this back into wetware.

Checking a few Unicode representation converters, unicode character 128187, or in hex 1F4BB.

In UTF-16, that should indeed be encoded as the surrogate pair D83D DCBB. In UTF-8 it's the bytes F0 9F 92 BB.

Of course if you're outputting XML or HTML, the numeric character reference (NCR) 💻 should indeed be equivalent to the literal character. The question is how that character should be flowing through the system, right?

Internally, Xalan uses Java characters, which are UTF-16. So its representation of this glyph is indeed the surrogate pair. On input, that refactoring gets done during reading of the stream, based on which encoding that stream has been told to expect.

I believe Java input streams normally assume UTF-8 unless told otherwise, which as noted above requires a different sequence of bytes than UTF-16 would. And they handle this conversion to the internal characters for us before Xalan itself ever sees the data. So it's clear what the behavior should be for raw bytes if you know the encoding. If you don't know the encoding but know that it's UTF-something, I think Unicode was *supposed* to be designed such that the first byte indicated how to read the following bytes; I'd have to check that but it would nail down the bytestream either way.

The question becomes one of whether NCRs expressing a surrogate sequence are expected to be converted by an XML or HTML parser. That would have to happen *after* the stream was read and the references were expanded.

I _suspect_ the intended answer is no, and that we should instead either be outputting raw bytes as appropriate for the encoding, assuming the encoding is UTF-* and can handle that character this way, or -- for XML and HTML *only*, since they're the only ones who have defined NCRs -- a single NCR which expresses the final value and leaves the question of appropriate internal encoding/representation for the receiving application to figure out.

In other words, if we're going to write it out as a numeric character reference (so it survives passing through non-unicode layers), I think you're right that sending it as 💻 is certainly safer than sending it as NCRs for each of the code units. A code unit is not a character.

This would add some cost to serialization, since now someone has to recognize that a sequence of Java chars may contain things which even UTF-16 requires a surrogate code for, and convert those pairs from UTF-16 to a single character number. That means checking ever Java char for whether it introduces a surrogate pair and sending those and their following bytes through an alternate serialization mechanism.

Which may not be all that bad. The HTML and XML serializers already need to check if characters have to be output as NCRs. I believe that would already recognize the first surrogate,and the additional work would apply only after that's determined to be true. So 99.44% of the time there should be no new cost.

Conclusion: The current behavior probably _is_ a bug, and the suggested replacement behavior appears to be appropriate.

OpenPGP_0xFFBAFF963D937815.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: supplementary characters emojis, etc turned ino surrogate pairs

Reply via email to