I'll try to summarize the key points of my more long-winded email to Joe:
The ToStream.java in Xalan cannot be wholesale replaced with the
ToStream in the JDK without dragging other classes along with it (even
accounting for the obvious package name changes). Or at least that
appeared to me to be the case on the surface, I did not try it out and
see how far that went, but instead tried to copy only those changes that
appeared to be relevant to the issue at hand (which I'll just call "the
encoding issue").
The JDK version of this class has changes from the Xalan version that
include coding style updates, a few renamings and small refactorings,
changes relating to output formatting (which includes an added buffering
step during the character processing), and some differences in the
initialization of byte arrays that the comments suggest are performance
optimizations, and of course changes relating to the encoding issue. It
is possible those optimizations are optimizations on the older code that
was forked and Xalan may have already optimized the same things in a
different way.
It's not clear to me whether Xalan would want some of these other
changes. Perhaps someone on this list knows something about the state of
Xalan wrt formatting and whether these changes might be of interest. I
also didn't address ToHTMLStream.java.
On 1/9/24 21:47, Gary Gregory wrote:
I am not concerned so much as to which list the conversation is
recorded on as long as it's on a project mailing list. You all can do
what you want of course but keep in mind that excluding a mailing list
removes the opportunity for people to observe and learn. It's just
simpler IMO to keep it on a list. You never know who might chime in
with an interesting tid bit or solution, that day, a week or month later.
My 2c,
Gary
On Tue, Jan 9, 2024, 9:16 PM Joseph Kesselman <kesh...@alum.mit.edu>
wrote:
Gary, at some level of detail it ought to at least cross over into
xalan-dev.
The user list shouldn't usually get into the implementation weeds,
though discussion of the correct and/or desired behavior is still
appropriate here.
Personally I would say some private brainstorming is harmless at
worst, as long as conclusions get reported back to the team. It's
impossible to prevent, and it _can_ be a good thing in letting
people explore something before making a more formal proposal.
Both formal and informal have their uses. I mean, cone on, does
your team thrash out every line if code in meetings, or do you go
off and work with others to come up with proposal/prototype first?
Let water-cooler chats happen, and just count on folks reporting
back. It works.
--
/_ Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
/ https://www.amazon.com/dp/B09WJ3H657/
Caveat: Opinionated old geezer with overcompensated writer's
block. May be redundant, verbose, prolix, sesquipedalian,
didactic, officious, or redundant.
------------------------------------------------------------------------
*From:* Gary Gregory <garydgreg...@gmail.com>
*Sent:* Tuesday, January 9, 2024 8:14:28 PM
*To:* Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com>
*Cc:* Joseph Kesselman <kesh...@alum.mit.edu>;
j-users@xalan.apache.org <j-users@xalan.apache.org>
*Subject:* Re: supplementary characters emojis, etc turned ino
surrogate pairs
There is no need for private communications IMO unless it's a
security issue in which case, we have a private security mailing
list.
Gary
On Tue, Jan 9, 2024, 5:36 PM Eric J. Schwarzenbach
<eric.schwarzenb...@wrycan.com> wrote:
I've managed to make Xalan do something more correct for my
test case by merging some bits from the JDK 21 version of
ToStream into Xalan 2.7.3's version.
Note that with the jdk code, what it does is replace either
💻 or the literal character it represents with the
equivalent decimal entity, 💻
Joe, I'm sending you an email directly about this since I
think it's beyond the scope of xalan-user.
On 1/9/24 13:34, Joseph Kesselman wrote:
No problem. I was around when we still had to teach people
the distinction, and error messages still sometimes get it wrong.
I'll try to look into it this week, unless someone beats me
to it.
--
/_ Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
/ https://www.amazon.com/dp/B09WJ3H657/
Caveat: Opinionated old geezer with overcompensated writer's
block. May be redundant, verbose, prolix, sesquipedalian,
didactic, officious, or redundant.
------------------------------------------------------------------------
*From:* Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com>
<mailto:eric.schwarzenb...@wrycan.com>
*Sent:* Tuesday, January 9, 2024 12:39:07 PM
*To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
<mailto:j-users@xalan.apache.org>
*Subject:* Re: supplementary characters emojis, etc turned
ino surrogate pairs
Apologies for the mistaken terminology. I do usually know the
different between valid and well-formed and am usually
careful about the distinction, however the idea that a
numeric character entity could break either was new to me,
and really doesn't really fit my notion of either. I repeated
the characterization of the problem that I had read without
checking into it. Sorry for that, and thanks for looking into it.
Cheers,
Eric
On 1/8/24 18:58, Joseph Kesselman wrote:
Please be careful to distinguish "Well-Formed" from
"Valid". If an XML tool complains that a document is not
valid that means it doesn't match the DTD or schema that
describes its expected structure, nor that it isn't correct
XML. It's better to avoid using the term valid unless you
mean Valid in the sense XML does.
A high-numeric-value surrogate pair shouldn't be a
well-formedness issue, barring a bug.
I haven't looked at this in any detail in a few decades, but
I'll try to at least sanity-check now that I'm emerging from
the build caverns again.
For convenience of others who might want it: : Legal
characters in XML 1.0 are defined at
https://www.w3.org/TR/xml/#charsets. I believe XML 1.1 added
the null character, originally not accepted. There are some
unicode ranges excluded, but those were supposed to be
permanently reserved blocks.
Xalan did originally have some issues with characters above
the UTF-16 range, mostly having to do with counts and
offsets since the first draft just used Java characters. But
I thought we had addresses those. Obviously if the fork
shipping in javax has solved it. A solution is possible and
probably already in our backlog...
--
/_ Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
/ https://www.amazon.com/dp/B09WJ3H657/
Caveat: Opinionated old geezer with overcompensated writer's
block. May be redundant, verbose, prolix, sesquipedalian,
didactic, officious, or redundant.
------------------------------------------------------------------------
*From:* Stanimir Stamenkov via j-users
<j-users@xalan.apache.org> <mailto:j-users@xalan.apache.org>
*Sent:* Monday, January 8, 2024 2:51:57 PM
*To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
<mailto:j-users@xalan.apache.org>
*Subject:* Re: supplementary characters emojis, etc turned
ino surrogate pairs
Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/:
> On 08/01/2024 16:28, Eric J. Schwarzenbach wrote:
>
>> Does anybody have a patch for
>>
>> https://issues.apache.org/jira/browse/XALANJ-2560
>>
>> That Xalan produces invalid XML with some utf-8
characters seems
>> rather serious. I find putting 💻 or the literal
character it
>> represents into an XML document and running it through
any XML-to-XML
>> transform results in it being replaced with
�� in the
>> output which evidently makes the XML invalid. I tried a
change to
>> ToStream.java from
https://issues.apache.org/jira/browse/XALANJ-2419
>> with the source of Xalan 2.7.3 but it did not help.
>
> Use Saxon, perhaps, or see whether
> https://stackoverflow.com/a/74245232/252228 helps for
patching Xalan.
One may also use just the JDK-supplied provider (a Xalan fork):
*
https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6
"XML
Entities"
Related to the patch referenced in the Stack Overflow
answer, one may
compare with the JDK sources as well:
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java
*
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java
--
Stanimir