I'll try to summarize the key points of my more long-winded email to Joe:

The ToStream.java in Xalan cannot be wholesale replaced with the ToStream in the JDK without dragging other classes along with it (even accounting for the obvious package name changes). Or at least that appeared to me to be the case on the surface, I did not try it out and see how far that went, but instead tried to copy only those changes that appeared to be relevant to the issue at hand (which I'll just call "the encoding issue").

The JDK version of this class has changes from the Xalan version that include coding style updates, a few renamings and small refactorings, changes relating to output formatting (which includes an added buffering step during the character processing), and some differences in the initialization of byte arrays that the comments suggest are performance optimizations, and of course changes relating to the encoding issue. It is possible those optimizations are optimizations on the older code that was forked and Xalan may have already optimized the same things in a different way.

It's not clear to me whether Xalan would want some of these other changes. Perhaps someone on this list knows something about the state of Xalan wrt formatting and whether these changes might be of interest. I also didn't address ToHTMLStream.java.


On 1/9/24 21:47, Gary Gregory wrote:
I am not concerned so much as to which list the conversation is recorded on as long as it's on a project mailing list. You all can do what you want of course but keep in mind that excluding a mailing list removes the opportunity for people to observe and learn. It's just simpler IMO to keep it on a list. You never know who might chime in with an interesting tid bit or solution, that day, a week or month later.

My 2c,
Gary

On Tue, Jan 9, 2024, 9:16 PM Joseph Kesselman <kesh...@alum.mit.edu> wrote:

    Gary, at some level of detail it ought to at least cross over into
    xalan-dev.

    The user list shouldn't usually get into the implementation weeds,
    though discussion of the correct and/or desired behavior is still
    appropriate here.

    Personally I would say some private brainstorming is harmless at
    worst, as long as conclusions get reported back to the team. It's
    impossible to prevent, and it _can_ be a good thing in letting
    people explore something before making a more formal proposal.
    Both formal and informal have their uses. I mean, cone on, does
    your team thrash out every line if code in meetings, or do you go
    off and work with others to come up with proposal/prototype first?

    Let water-cooler chats happen, and just count on folks reporting
    back. It works.

    --
       /_  Joe Kesselman (he/him/his)
    -/ _) My Alexa skill for New Music/New Sounds fans:
       / https://www.amazon.com/dp/B09WJ3H657/

    Caveat: Opinionated old geezer with overcompensated writer's
    block. May be redundant, verbose, prolix, sesquipedalian,
    didactic, officious, or redundant.
    ------------------------------------------------------------------------
    *From:* Gary Gregory <garydgreg...@gmail.com>
    *Sent:* Tuesday, January 9, 2024 8:14:28 PM
    *To:* Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com>
    *Cc:* Joseph Kesselman <kesh...@alum.mit.edu>;
    j-users@xalan.apache.org <j-users@xalan.apache.org>
    *Subject:* Re: supplementary characters emojis, etc turned ino
    surrogate pairs
    There is no need for private communications IMO unless it's a
    security issue in which case, we have a private security mailing
    list.

    Gary

    On Tue, Jan 9, 2024, 5:36 PM Eric J. Schwarzenbach
    <eric.schwarzenb...@wrycan.com> wrote:

        I've managed to make Xalan do something more correct for my
        test case by merging some bits from the JDK 21 version of
        ToStream into Xalan 2.7.3's version.

        Note that with the jdk code, what it does is replace either
        &#x1F4BB; or the literal character it represents with the
        equivalent decimal entity, &#128187;

        Joe, I'm sending you an email directly about this since I
        think it's beyond the scope of xalan-user.


        On 1/9/24 13:34, Joseph Kesselman wrote:
        No problem. I was around when we still had to teach people
        the distinction, and error messages still sometimes get it wrong.

        I'll try to look into it this week, unless someone beats me
        to it.

        --
           /_  Joe Kesselman (he/him/his)
        -/ _) My Alexa skill for New Music/New Sounds fans:
           / https://www.amazon.com/dp/B09WJ3H657/

        Caveat: Opinionated old geezer with overcompensated writer's
        block. May be redundant, verbose, prolix, sesquipedalian,
        didactic, officious, or redundant.
        ------------------------------------------------------------------------
        *From:* Eric J. Schwarzenbach <eric.schwarzenb...@wrycan.com>
        <mailto:eric.schwarzenb...@wrycan.com>
        *Sent:* Tuesday, January 9, 2024 12:39:07 PM
        *To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
        <mailto:j-users@xalan.apache.org>
        *Subject:* Re: supplementary characters emojis, etc turned
        ino surrogate pairs

        Apologies for the mistaken terminology. I do usually know the
        different between valid and well-formed and am usually
        careful about the distinction, however the idea that a
        numeric character entity could break either was new to me,
        and really doesn't really fit my notion of either. I repeated
        the characterization of the problem that I had read without
        checking into it. Sorry for that, and thanks for looking into it.

        Cheers,

        Eric

        On 1/8/24 18:58, Joseph Kesselman wrote:
        Please be careful to distinguish "Well-Formed"  from
        "Valid". If an XML tool complains that a document is not
        valid that means it doesn't match the DTD or schema that
        describes its expected structure, nor that it isn't correct
        XML. It's better to avoid using the term valid unless you
        mean Valid in the sense XML does.

        A high-numeric-value surrogate pair shouldn't be a
        well-formedness issue, barring a bug.

        I haven't looked at this in any detail in a few decades, but
        I'll try to at least sanity-check now that I'm emerging from
        the build caverns again.


        For convenience of others who might want it: : Legal
        characters in XML 1.0 are defined at
        https://www.w3.org/TR/xml/#charsets. I believe XML 1.1 added
        the null character, originally not accepted. There are some
        unicode ranges excluded, but those were supposed to be
        permanently reserved blocks.

        Xalan did originally have some issues with characters above
        the UTF-16 range, mostly having to do with counts and
        offsets since the first draft just used Java characters. But
        I thought we had addresses those. Obviously if the fork
        shipping in javax has solved it. A solution is possible and
        probably already in our backlog...



        --
           /_  Joe Kesselman (he/him/his)
        -/ _) My Alexa skill for New Music/New Sounds fans:
           / https://www.amazon.com/dp/B09WJ3H657/

        Caveat: Opinionated old geezer with overcompensated writer's
        block. May be redundant, verbose, prolix, sesquipedalian,
        didactic, officious, or redundant.
        ------------------------------------------------------------------------
        *From:* Stanimir Stamenkov via j-users
        <j-users@xalan.apache.org> <mailto:j-users@xalan.apache.org>
        *Sent:* Monday, January 8, 2024 2:51:57 PM
        *To:* j-users@xalan.apache.org <j-users@xalan.apache.org>
        <mailto:j-users@xalan.apache.org>
        *Subject:* Re: supplementary characters emojis, etc turned
        ino surrogate pairs
        Mon, 8 Jan 2024 16:33:39 +0100, /Martin Honnen/:
        > On 08/01/2024 16:28, Eric J. Schwarzenbach wrote:
        >
        >> Does anybody have a patch for
        >>
        >> https://issues.apache.org/jira/browse/XALANJ-2560
        >>
        >> That Xalan produces invalid XML with some utf-8
        characters seems
        >> rather serious. I find putting &#x1F4BB; or the literal
        character it
        >> represents into an XML document and running it through
        any XML-to-XML
        >> transform results in it being replaced with
        &#55357;&#56507; in the
        >> output which evidently makes the XML invalid. I tried a
        change to
        >> ToStream.java from
        https://issues.apache.org/jira/browse/XALANJ-2419
        >> with the source of Xalan 2.7.3 but it did not help.
        >
        > Use Saxon, perhaps, or see whether
        > https://stackoverflow.com/a/74245232/252228 helps for
        patching Xalan.

        One may also use just the JDK-supplied provider (a Xalan fork):

        *
        https://lists.apache.org/thread/3hzpj1gt1ql38d17dcfxrgss872v50l6
        "XML
        Entities"

        Related to the patch referenced in the Stack Overflow
        answer, one may
        compare with the JDK sources as well:

        *
        
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java
        *
        
https://github.com/openjdk/jdk/blob/jdk-21-ga/src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToHTMLStream.java

-- Stanimir

Reply via email to