Thanks. Interesting.

I still need to sanity check whether both forms should be accepted, to make 
sure the problem really is a bug in Xalan rather than in whatever is reading 
our output. In the latter case we might consider changing anyway, but that 
would raise the question of what to do if some other application doesn't like 
the large numbers in decimal form... which might mean we should make this 
behavior change configurable.

Do you know which code is producing the error message? If the hex form *is* 
correct and supposed to be fully equivalent to the decimal form, we'd want to 
report a bug against that.

--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   /   https://www.amazon.com/dp/B09WJ3H657/

Caveat: Opinionated old geezer with overcompensated writer's block. May be 
redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant.
________________________________
From: Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com>
Sent: Tuesday, January 9, 2024 5:36:19 PM
To: Joseph Kesselman <kesh...@alum.mit.edu>; j-users@xalan.apache.org 
<j-users@xalan.apache.org>
Subject: Re: supplementary characters emojis, etc turned ino surrogate pairs


I've managed to make Xalan do something more correct for my test case by 
merging some bits from the JDK 21 version of ToStream into Xalan 2.7.3's 
version.

Note that with the jdk code, what it does is replace either &#x1F4BB; or the 
literal character it represents with the equivalent decimal entity, &#128187;

Joe, I'm sending you an email directly about this since I think it's beyond the 
scope of xalan-user.


On 1/9/24 13:34, Joseph Kesselman wrote:
No problem. I was around when we still had to teach people the distinction, and 
error messages still sometimes get it wrong.

I'll try to look into it this week, unless someone beats me to it.

--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   /   https://www.amazon.com/dp/B09WJ3H657/

Caveat: Opinionated old geezer with overcompensated writer's block. May be 
redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant.
________________________________
From: Eric J. Schwarzenbach 
<eric.schwarzenb...@wrycan.com><mailto:eric.schwarzenb...@wrycan.com>
Sent: Tuesday, January 9, 2024 12:39:07 PM
To: j-users@xalan.apache.org<mailto:j-users@xalan.apache.org> 
<j-users@xalan.apache.org><mailto:j-users@xalan.apache.org>
Subject: Re: supplementary characters emojis, etc turned ino surrogate pairs


Apologies for the mistaken terminology. I do usually know the different between 
valid and well-formed and am usually careful about the distinction, however the 
idea that a numeric character entity could break either was new to me, and 
really doesn't really fit my notion of either. I repeated the characterization 
of the problem that I had read without checking into it. Sorry for that, and 
thanks for looking into it.

Cheers,

Eric

On 1/8/24 18:58, Joseph Kesselman wrote:
Please be careful to distinguish "Well-Formed"  from "Valid". If an XML tool 
complains that a document is not valid that means it doesn't match the DTD or 
schema that describes its expected structure, nor that it isn't correct XML. 
It's better to avoid using the term valid unless you mean Valid in the sense 
XML does.

A high-numeric-value surrogate pair shouldn't be a well-formedness issue, 
barring a bug.

I haven't looked at this in any detail in a few decades, but I'll try to at 
least sanity-check now that I'm emerging from the build caverns again.


For convenience of others who might want it: : Legal characters in XML 1.0 are 
defined at https://www.w3.org/TR/xml/#charsets. I believe XML 1.1 added the 
null character, originally not accepted. There are some unicode ranges 
excluded, but those were supposed to be permanently reserved blocks.

Xalan did originally have some issues with characters above the UTF-16 range, 
mostly having to do with counts and offsets since the first draft just used 
Java characters. But I thought we had addresses those. Obviously if the fork 
shipping in javax has solved it. A solution is possible and probably already in 
our backlog...



--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   /   https://www.amazon.com/dp/B09WJ3H657/

Caveat: Opinionated old geezer with overcompensated writer's block. May be 
redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant.
________________________________
From: Stanimir Stamenkov via j-users 
<j-users@xalan.apache.org><mailto:j-users@xalan.apache.org>
Sent: Monday, January 8, 2024 2:51:57 PM
To: j-users@xalan.apache.org<mailto:j-users@xalan.apache.org> 
<j-users@xalan.apache.org><mailto:j-users@xalan.apache.org>
Subject: Re: supplementary characters emojis, etc turned ino surrogate pairs

Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/:
> On 08/01/2024 16:28, Eric J. Schwarzenbach wrote:
>
>> Does anybody have a patch for
>>
>> https://issues.apache.org/jira/browse/XALANJ-2560
>>
>> That Xalan produces invalid XML with some utf-8 characters seems
>> rather serious. I find putting &#x1F4BB; or the literal character it
>> represents into an XML document and running it through any XML-to-XML
>> transform results in it being replaced with &#55357;&#56507; in the
>> output which evidently makes the XML invalid. I tried a change to
>> ToStream.java from https://issues.apache.org/jira/browse/XALANJ-2419
>> with the source of Xalan 2.7.3 but it did not help.
>
> Use Saxon, perhaps, or see whether
> https://stackoverflow.com/a/74245232/252228 helps for patching Xalan.

One may also use just the JDK-supplied provider (a Xalan fork):

* https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6 "XML
Entities"

Related to the patch referenced in the Stack Overflow answer, one may
compare with the JDK sources as well:

*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java

--
Stanimir

Reply via email to