Apologies for the mistaken terminology. I do usually know the different
between valid and well-formed and am usually careful about the
distinction, however the idea that a numeric character entity could
break either was new to me, and really doesn't really fit my notion of
either. I repeated the characterization of the problem that I had read
without checking into it. Sorry for that, and thanks for looking into it.
Cheers,
Eric
On 1/8/24 18:58, Joseph Kesselman wrote:
Please be careful to distinguish "Well-Formed" from "Valid". If an
XML tool complains that a document is not valid that means it doesn't
match the DTD or schema that describes its expected structure, nor
that it isn't correct XML. It's better to avoid using the term valid
unless you mean Valid in the sense XML does.
A high-numeric-value surrogate pair shouldn't be a well-formedness
issue, barring a bug.
I haven't looked at this in any detail in a few decades, but I'll try
to at least sanity-check now that I'm emerging from the build caverns
again.
For convenience of others who might want it: : Legal characters in XML
1.0 are defined at https://www.w3.org/TR/xml/#charsets. I believe XML
1.1 added the null character, originally not accepted. There are some
unicode ranges excluded, but those were supposed to be permanently
reserved blocks.
Xalan did originally have some issues with characters above the UTF-16
range, mostly having to do with counts and offsets since the first
draft just used Java characters. But I thought we had addresses those.
Obviously if the fork shipping in javax has solved it. A solution is
possible and probably already in our backlog...
--
/_ Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
/ https://www.amazon.com/dp/B09WJ3H657/
Caveat: Opinionated old geezer with overcompensated writer's block.
May be redundant, verbose, prolix, sesquipedalian, didactic,
officious, or redundant.
------------------------------------------------------------------------
*From:* Stanimir Stamenkov via j-users <j-users@xalan.apache.org>
*Sent:* Monday, January 8, 2024 2:51:57 PM
*To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
*Subject:* Re: supplementary characters emojis, etc turned ino
surrogate pairs
Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/:
> On 08/01/2024 16:28, Eric J. Schwarzenbach wrote:
>
>> Does anybody have a patch for
>>
>> https://issues.apache.org/jira/browse/XALANJ-2560
>>
>> That Xalan produces invalid XML with some utf-8 characters seems
>> rather serious. I find putting 💻 or the literal character it
>> represents into an XML document and running it through any XML-to-XML
>> transform results in it being replaced with �� in the
>> output which evidently makes the XML invalid. I tried a change to
>> ToStream.java from https://issues.apache.org/jira/browse/XALANJ-2419
>> with the source of Xalan 2.7.3 but it did not help.
>
> Use Saxon, perhaps, or see whether
> https://stackoverflow.com/a/74245232/252228 helps for patching Xalan.
One may also use just the JDK-supplied provider (a Xalan fork):
* https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6 "XML
Entities"
Related to the patch referenced in the Stack Overflow answer, one may
compare with the JDK sources as well:
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java
--
Stanimir