Re: supplementary characters emojis, etc turned ino surrogate pairs

Eric J. Schwarzenbach Tue, 09 Jan 2024 09:39:17 -0800

Apologies for the mistaken terminology. I do usually know the differentbetween valid and well-formed and am usually careful about thedistinction, however the idea that a numeric character entity couldbreak either was new to me, and really doesn't really fit my notion ofeither. I repeated the characterization of the problem that I had readwithout checking into it. Sorry for that, and thanks for looking into it.


Cheers,


Eric

On 1/8/24 18:58, Joseph Kesselman wrote:

Please be careful to distinguish "Well-Formed" from "Valid". If anXML tool complains that a document is not valid that means it doesn'tmatch the DTD or schema that describes its expected structure, northat it isn't correct XML. It's better to avoid using the term validunless you mean Valid in the sense XML does.
A high-numeric-value surrogate pair shouldn't be a well-formednessissue, barring a bug.
I haven't looked at this in any detail in a few decades, but I'll tryto at least sanity-check now that I'm emerging from the build cavernsagain.
For convenience of others who might want it: : Legal characters in XML1.0 are defined at https://www.w3.org/TR/xml/#charsets. I believe XML1.1 added the null character, originally not accepted. There are someunicode ranges excluded, but those were supposed to be permanentlyreserved blocks.
Xalan did originally have some issues with characters above the UTF-16range, mostly having to do with counts and offsets since the firstdraft just used Java characters. But I thought we had addresses those.Obviously if the fork shipping in javax has solved it. A solution ispossible and probably already in our backlog...
--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   / https://www.amazon.com/dp/B09WJ3H657/
Caveat: Opinionated old geezer with overcompensated writer's block.May be redundant, verbose, prolix, sesquipedalian, didactic,officious, or redundant.
------------------------------------------------------------------------
*From:* Stanimir Stamenkov via j-users <j-users@xalan.apache.org>
*Sent:* Monday, January 8, 2024 2:51:57 PM
*To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
*Subject:* Re: supplementary characters emojis, etc turned inosurrogate pairs
Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/:
> On 08/01/2024 16:28, Eric J. Schwarzenbach wrote:
>
>> Does anybody have a patch for
>>
>> https://issues.apache.org/jira/browse/XALANJ-2560
>>
>> That Xalan produces invalid XML with some utf-8 characters seems
>> rather serious. I find putting &#x1F4BB; or the literal character it
>> represents into an XML document and running it through any XML-to-XML
>> transform results in it being replaced with &#55357;&#56507; in the
>> output which evidently makes the XML invalid. I tried a change to
>> ToStream.java from https://issues.apache.org/jira/browse/XALANJ-2419
>> with the source of Xalan 2.7.3 but it did not help.
>
> Use Saxon, perhaps, or see whether
> https://stackoverflow.com/a/74245232/252228 helps for patching Xalan.

One may also use just the JDK-supplied provider (a Xalan fork):

* https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6 "XML
Entities"

Related to the patch referenced in the Stack Overflow answer, one may
compare with the JDK sources as well:

*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java

--
Stanimir

Re: supplementary characters emojis, etc turned ino surrogate pairs

Reply via email to