Thanks. Interesting. I still need to sanity check whether both forms should be accepted, to make sure the problem really is a bug in Xalan rather than in whatever is reading our output. In the latter case we might consider changing anyway, but that would raise the question of what to do if some other application doesn't like the large numbers in decimal form... which might mean we should make this behavior change configurable.
Do you know which code is producing the error message? If the hex form *is* correct and supposed to be fully equivalent to the decimal form, we'd want to report a bug against that. -- /_ Joe Kesselman (he/him/his) -/ _) My Alexa skill for New Music/New Sounds fans: / https://www.amazon.com/dp/B09WJ3H657/ Caveat: Opinionated old geezer with overcompensated writer's block. May be redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant. ________________________________ From: Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com> Sent: Tuesday, January 9, 2024 5:36:19 PM To: Joseph Kesselman <kesh...@alum.mit.edu>; j-users@xalan.apache.org <j-users@xalan.apache.org> Subject: Re: supplementary characters emojis, etc turned ino surrogate pairs I've managed to make Xalan do something more correct for my test case by merging some bits from the JDK 21 version of ToStream into Xalan 2.7.3's version. Note that with the jdk code, what it does is replace either 💻 or the literal character it represents with the equivalent decimal entity, 💻 Joe, I'm sending you an email directly about this since I think it's beyond the scope of xalan-user. On 1/9/24 13:34, Joseph Kesselman wrote: No problem. I was around when we still had to teach people the distinction, and error messages still sometimes get it wrong. I'll try to look into it this week, unless someone beats me to it. -- /_ Joe Kesselman (he/him/his) -/ _) My Alexa skill for New Music/New Sounds fans: / https://www.amazon.com/dp/B09WJ3H657/ Caveat: Opinionated old geezer with overcompensated writer's block. May be redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant. ________________________________ From: Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com><mailto:eric.schwarzenb...@wrycan.com> Sent: Tuesday, January 9, 2024 12:39:07 PM To: j-users@xalan.apache.org<mailto:j-users@xalan.apache.org> <j-users@xalan.apache.org><mailto:j-users@xalan.apache.org> Subject: Re: supplementary characters emojis, etc turned ino surrogate pairs Apologies for the mistaken terminology. I do usually know the different between valid and well-formed and am usually careful about the distinction, however the idea that a numeric character entity could break either was new to me, and really doesn't really fit my notion of either. I repeated the characterization of the problem that I had read without checking into it. Sorry for that, and thanks for looking into it. Cheers, Eric On 1/8/24 18:58, Joseph Kesselman wrote: Please be careful to distinguish "Well-Formed" from "Valid". If an XML tool complains that a document is not valid that means it doesn't match the DTD or schema that describes its expected structure, nor that it isn't correct XML. It's better to avoid using the term valid unless you mean Valid in the sense XML does. A high-numeric-value surrogate pair shouldn't be a well-formedness issue, barring a bug. I haven't looked at this in any detail in a few decades, but I'll try to at least sanity-check now that I'm emerging from the build caverns again. For convenience of others who might want it: : Legal characters in XML 1.0 are defined at https://www.w3.org/TR/xml/#charsets. I believe XML 1.1 added the null character, originally not accepted. There are some unicode ranges excluded, but those were supposed to be permanently reserved blocks. Xalan did originally have some issues with characters above the UTF-16 range, mostly having to do with counts and offsets since the first draft just used Java characters. But I thought we had addresses those. Obviously if the fork shipping in javax has solved it. A solution is possible and probably already in our backlog... -- /_ Joe Kesselman (he/him/his) -/ _) My Alexa skill for New Music/New Sounds fans: / https://www.amazon.com/dp/B09WJ3H657/ Caveat: Opinionated old geezer with overcompensated writer's block. May be redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant. ________________________________ From: Stanimir Stamenkov via j-users <j-users@xalan.apache.org><mailto:j-users@xalan.apache.org> Sent: Monday, January 8, 2024 2:51:57 PM To: j-users@xalan.apache.org<mailto:j-users@xalan.apache.org> <j-users@xalan.apache.org><mailto:j-users@xalan.apache.org> Subject: Re: supplementary characters emojis, etc turned ino surrogate pairs Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/: > On 08/01/2024 16:28, Eric J. Schwarzenbach wrote: > >> Does anybody have a patch for >> >> https://issues.apache.org/jira/browse/XALANJ-2560 >> >> That Xalan produces invalid XML with some utf-8 characters seems >> rather serious. I find putting 💻 or the literal character it >> represents into an XML document and running it through any XML-to-XML >> transform results in it being replaced with �� in the >> output which evidently makes the XML invalid. I tried a change to >> ToStream.java from https://issues.apache.org/jira/browse/XALANJ-2419 >> with the source of Xalan 2.7.3 but it did not help. > > Use Saxon, perhaps, or see whether > https://stackoverflow.com/a/74245232/252228 helps for patching Xalan. One may also use just the JDK-supplied provider (a Xalan fork): * https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6 "XML Entities" Related to the patch referenced in the Stack Overflow answer, one may compare with the JDK sources as well: * https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java * https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java -- Stanimir