[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089348#comment-18089348
 ] 

Ilya Basin edited comment on XALANJ-2419 at 6/16/26 10:31 AM:
--------------------------------------------------------------

All was fine until our *{{<xsl:copy>}}* in a transform tried to split an astral 
character at a chunk boundary which produced two invalid xml entities again. So 
we were forced to apply more patches from upstream.

Also we tried to clone and run [xalan-test 
xalan-j_2_7_3-rc10|https://github.com/apache/xalan-test/tree/xalan-j_2_7_3-rc10]
 but *{{ant check}}* was failing with missing target "{*}api{*}" and after I 
reverted the commit that renamed that target it was failing to create the 
*{{results-api/Pass-XXX.xml}}* files so we gave up.

 

 
{code:java}
commit dfb727767ccbebdb989049de89904521ee981610
Author: kubycsolutions <[email protected]>
Date:   Wed Feb 21 19:55:04 2024 -0500
    Document the characters()other()characters() issue if first char buffer 
ended in a high surrogate.
commit ec7f0e25d85192443a9fef2534e7625176fbfa4c
Author: kubycsolutions <[email protected]>
Date:   Wed Feb 21 14:51:48 2024 -0500
    This one's working for the test added in 2725. May not be cleanest 
solution, and I'm not sure whether any of the other surrogate handling needs 
similar fixes -- I don't know whether they ever run into the buffer break 
problem.
commit 856e896e42bc409e730ed5de0c1e5cd416b8bbc7
Author: kubycsolutions <[email protected]>
Date:   Mon Feb 19 17:03:53 2024 -0500
    refactoring
commit 162e1f0b4c71669e3c8da8c6d1b7b4ddcdda5789
Author: kubycsolutions <[email protected]>
Date:   Fri Feb 2 14:02:15 2024 -0500
    just documentation/parameter names

# This one is special: only 4 related files were patched:
# **/SerializerBase.java
# **/ToHTMLStream.java   
# **/ToStream.java      
# **/ToXMLStream.java   
commit beb73389025828731a776d3e10de6cecd6bab1fd
Author: kubycsolutions <[email protected]>
Date:   Sun Oct 22 18:58:13 2023 -0400
    Deletions, additions, and modifications to complete Maven cut-over.
{code}
 

_


was (Author: basinilya):
All was fine until our *{{<xsl:copy>}}* in a transform tried to split an astral 
character at a chunk boundary which produced two invalid xml entities again. So 
we were forced to apply more patches from upstream.

Also we tried to clone and run [xalan-test 
xalan-j_2_7_3-rc10|https://github.com/apache/xalan-test/tree/xalan-j_2_7_3-rc10]
 but *{{ant check}}* was failing with missing target "{*}api{*}" and after I 
reverted the commit that renamed that target it was failing to create the 
*{{results-api/Pass-XXX.xml}}* files so we gave up.

 

 
{code:java}
commit dfb727767ccbebdb989049de89904521ee981610
Author: kubycsolutions <[email protected]>
Date:   Wed Feb 21 19:55:04 2024 -0500
    Document the characters()other()characters() issue if first char buffer 
ended in a high surrogate.
commit ec7f0e25d85192443a9fef2534e7625176fbfa4c
Author: kubycsolutions <[email protected]>
Date:   Wed Feb 21 14:51:48 2024 -0500
    This one's working for the test added in 2725. May not be cleanest 
solution, and I'm not sure whether any of the other surrogate handling needs 
similar fixes -- I don't know whether they ever run into the buffer break 
problem.
commit 856e896e42bc409e730ed5de0c1e5cd416b8bbc7
Author: kubycsolutions <[email protected]>
Date:   Mon Feb 19 17:03:53 2024 -0500
    refactoring
commit 162e1f0b4c71669e3c8da8c6d1b7b4ddcdda5789
Author: kubycsolutions <[email protected]>
Date:   Fri Feb 2 14:02:15 2024 -0500
    just documentation/parameter names
commit beb73389025828731a776d3e10de6cecd6bab1fd
Author: kubycsolutions <[email protected]>
Date:   Sun Oct 22 18:58:13 2023 -0400
    Deletions, additions, and modifications to complete Maven cut-over.
{code}
 

_

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Assignee: Joe Kesselman
>            Priority: Major
>             Fix For: The Latest Development Code
>
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to