[jira] [Reopened] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Joe Kesselman (Jira) Wed, 24 Jan 2024 08:24:04 -0800


     [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joe Kesselman reopened XALANJ-2419:
-----------------------------------

This is *PROBABLY* solved, but with the now-checked-in code I'm seeing some odd 
behavior where invoking the new ToXMLStreamTest from the command line is 
failing while running them under "build.sh apitest" is succeeding. (The issues 
are in ISO-8859-1 output.)

This may be an environment issue – my command-line isn't explicitly setting the 
parser, for example, and it probably should. Or the apitest target may be 
failing to set the processor factory to point to Apache Xalan, and running the 
jre's shadowed version.  Or I may have stepped on an input or gold file such 
that it isn't encoded as expected.

Investigating.

[Mentally insert grumpy-cat (Tardar Sauce) picture here]

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Assignee: Joe Kesselman
>            Priority: Major
>             Fix For: The Latest Development Code
>
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Reopened] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to