[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

JIRA Thu, 21 Feb 2019 18:27:34 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774695#comment-16774695
 ]


Jesper Steen Møller commented on XALANJ-2419:
---------------------------------------------

Yeah, the problem is that 
org.apache.xml.serializer.EncodingInfo.inEncoding(char, String) checks for code 
ability using the "<JAVA name encoding>" found in `Encodings.property`, and 
they use legacy names which have been purged from Java 11. So it gets the 
exception and so notes that ISO-8859-1 should be escaped beyond >127.

Interestingly, I've named the test case "ISO-8859-1 characters should come out 
as entities" wrong, it's the other way around: They should come out as chars, 
and that's what's being tested.

As for the astral characters, they do come out as astral characters, but the 
test also had an \u00a4 character in their expected output, and due to the same 
problem, it case out as an entity.

Changing the charset name for the test is only stop-gap measure, since the way 
the Java charset name is found from Encodings.property is actually wrong: Line 
372 of org.apache.xml.serializer.Encodings.loadEncodingInfo() overwrites the 
Mime mapping for the proper encoding name (ISO-8859-1 in this case) with the 
last associated Java charset name seen in the file, which ends up being the 
worst (= not supported by the JRE). When you use the alternative name, it finds 
a non-mangled version, since it has too look up by Java encoding name instead 
of MIME name. This should really be fixed.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Priority: Major
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to