Apologies for the mistaken terminology. I do usually know the different between valid and well-formed and am usually careful about the distinction, however the idea that a numeric character entity could break either was new to me, and really doesn't really fit my notion of either. I repeated the characterization of the problem that I had read without checking into it. Sorry for that, and thanks for looking into it.

Cheers,

Eric

On 1/8/24 18:58, Joseph Kesselman wrote:
Please be careful to distinguish "Well-Formed"  from "Valid". If an XML tool complains that a document is not valid that means it doesn't match the DTD or schema that describes its expected structure, nor that it isn't correct XML. It's better to avoid using the term valid unless you mean Valid in the sense XML does.

A high-numeric-value surrogate pair shouldn't be a well-formedness issue, barring a bug.

I haven't looked at this in any detail in a few decades, but I'll try to at least sanity-check now that I'm emerging from the build caverns again.


For convenience of others who might want it: : Legal characters in XML 1.0 are defined at https://www.w3.org/TR/xml/#charsets. I believe XML 1.1 added the null character, originally not accepted. There are some unicode ranges excluded, but those were supposed to be permanently reserved blocks.

Xalan did originally have some issues with characters above the UTF-16 range, mostly having to do with counts and offsets since the first draft just used Java characters. But I thought we had addresses those. Obviously if the fork shipping in javax has solved it. A solution is possible and probably already in our backlog...



--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   / https://www.amazon.com/dp/B09WJ3H657/

Caveat: Opinionated old geezer with overcompensated writer's block. May be redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant.
------------------------------------------------------------------------
*From:* Stanimir Stamenkov via j-users <j-users@xalan.apache.org>
*Sent:* Monday, January 8, 2024 2:51:57 PM
*To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
*Subject:* Re: supplementary characters emojis, etc turned ino surrogate pairs
Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/:
> On 08/01/2024 16:28, Eric J. Schwarzenbach wrote:
>
>> Does anybody have a patch for
>>
>> https://issues.apache.org/jira/browse/XALANJ-2560
>>
>> That Xalan produces invalid XML with some utf-8 characters seems
>> rather serious. I find putting &#x1F4BB; or the literal character it
>> represents into an XML document and running it through any XML-to-XML
>> transform results in it being replaced with &#55357;&#56507; in the
>> output which evidently makes the XML invalid. I tried a change to
>> ToStream.java from https://issues.apache.org/jira/browse/XALANJ-2419
>> with the source of Xalan 2.7.3 but it did not help.
>
> Use Saxon, perhaps, or see whether
> https://stackoverflow.com/a/74245232/252228 helps for patching Xalan.

One may also use just the JDK-supplied provider (a Xalan fork):

* https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6 "XML
Entities"

Related to the patch referenced in the Stack Overflow answer, one may
compare with the JDK sources as well:

*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java

--
Stanimir

Reply via email to