[
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089348#comment-18089348
]
Ilya Basin edited comment on XALANJ-2419 at 6/16/26 10:31 AM:
--------------------------------------------------------------
All was fine until our *{{<xsl:copy>}}* in a transform tried to split an astral
character at a chunk boundary which produced two invalid xml entities again. So
we were forced to apply more patches from upstream.
Also we tried to clone and run [xalan-test
xalan-j_2_7_3-rc10|https://github.com/apache/xalan-test/tree/xalan-j_2_7_3-rc10]
but *{{ant check}}* was failing with missing target "{*}api{*}" and after I
reverted the commit that renamed that target it was failing to create the
*{{results-api/Pass-XXX.xml}}* files so we gave up.
{code:java}
commit dfb727767ccbebdb989049de89904521ee981610
Author: kubycsolutions <[email protected]>
Date: Wed Feb 21 19:55:04 2024 -0500
Document the characters()other()characters() issue if first char buffer
ended in a high surrogate.
commit ec7f0e25d85192443a9fef2534e7625176fbfa4c
Author: kubycsolutions <[email protected]>
Date: Wed Feb 21 14:51:48 2024 -0500
This one's working for the test added in 2725. May not be cleanest
solution, and I'm not sure whether any of the other surrogate handling needs
similar fixes -- I don't know whether they ever run into the buffer break
problem.
commit 856e896e42bc409e730ed5de0c1e5cd416b8bbc7
Author: kubycsolutions <[email protected]>
Date: Mon Feb 19 17:03:53 2024 -0500
refactoring
commit 162e1f0b4c71669e3c8da8c6d1b7b4ddcdda5789
Author: kubycsolutions <[email protected]>
Date: Fri Feb 2 14:02:15 2024 -0500
just documentation/parameter names
# This one is special: only 4 related files were patched:
# **/SerializerBase.java
# **/ToHTMLStream.java
# **/ToStream.java
# **/ToXMLStream.java
commit beb73389025828731a776d3e10de6cecd6bab1fd
Author: kubycsolutions <[email protected]>
Date: Sun Oct 22 18:58:13 2023 -0400
Deletions, additions, and modifications to complete Maven cut-over.
{code}
_
was (Author: basinilya):
All was fine until our *{{<xsl:copy>}}* in a transform tried to split an astral
character at a chunk boundary which produced two invalid xml entities again. So
we were forced to apply more patches from upstream.
Also we tried to clone and run [xalan-test
xalan-j_2_7_3-rc10|https://github.com/apache/xalan-test/tree/xalan-j_2_7_3-rc10]
but *{{ant check}}* was failing with missing target "{*}api{*}" and after I
reverted the commit that renamed that target it was failing to create the
*{{results-api/Pass-XXX.xml}}* files so we gave up.
{code:java}
commit dfb727767ccbebdb989049de89904521ee981610
Author: kubycsolutions <[email protected]>
Date: Wed Feb 21 19:55:04 2024 -0500
Document the characters()other()characters() issue if first char buffer
ended in a high surrogate.
commit ec7f0e25d85192443a9fef2534e7625176fbfa4c
Author: kubycsolutions <[email protected]>
Date: Wed Feb 21 14:51:48 2024 -0500
This one's working for the test added in 2725. May not be cleanest
solution, and I'm not sure whether any of the other surrogate handling needs
similar fixes -- I don't know whether they ever run into the buffer break
problem.
commit 856e896e42bc409e730ed5de0c1e5cd416b8bbc7
Author: kubycsolutions <[email protected]>
Date: Mon Feb 19 17:03:53 2024 -0500
refactoring
commit 162e1f0b4c71669e3c8da8c6d1b7b4ddcdda5789
Author: kubycsolutions <[email protected]>
Date: Fri Feb 2 14:02:15 2024 -0500
just documentation/parameter names
commit beb73389025828731a776d3e10de6cecd6bab1fd
Author: kubycsolutions <[email protected]>
Date: Sun Oct 22 18:58:13 2023 -0400
Deletions, additions, and modifications to complete Maven cut-over.
{code}
_
> Astral characters written as a pair of NCRs with the surrogate scalar values
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
> Issue Type: Bug
> Components: Serialization
> Affects Versions: 2.7.1
> Reporter: Henri Sivonen
> Assignee: Joe Kesselman
> Priority: Major
> Fix For: The Latest Development Code
>
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
>
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do? We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters
> in it ends up in an ill-formed serialization and does not parse back using an
> XML parser.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]