[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

JIRA Fri, 22 Feb 2019 06:41:13 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775205#comment-16775205
 ]


Jesper Steen Møller commented on XALANJ-2419:
---------------------------------------------

Ok, now I get what's wrong with the encoding. It's bad.

The file `Encodings.property` contains mappings between several "Java names" 
(which probably made sense in the last millenium) and MIME names (which are 
what should be in the XML inputs). There's some logic to only register the 
first, but since that's iterating from a Properties object, that will NOT 
represent the order in the file. In other words, unpredictable. That's why it 
suddenly worked when you specified ISO8859_1. I don't know what changed for 
Java 11, it could be the hashtable ordering, or the accepted charset names, but 
the crux is that "8859-1" is NOT an acceptable Java name:
{code:java}
jshell> "\u00e8".getBytes("ISO-8859-1")
$1 ==> byte[1] { -24 } 
jshell> "\u00e8".getBytes("8859_1")
$2 ==> byte[1] { -24 }
 
jshell> "\u00e8".getBytes("8859-1")
|  Exception java.io.UnsupportedEncodingException: 8859-1
|        at StringCoding.encode (StringCoding.java:427)
|        at String.getBytes (String.java:941)
|        at (#3:1)
jshell> "\u00e8".getBytes("ISO8859-1")
$4 ==> byte[1] { -24 }
jshell> "\u00e8".getBytes("ISO8859_1")
$5 ==> byte[1] { -24 }
jshell>
{code}
 

Possible fix: Remove the line "8859-1     ISO-8859-1                            
 0x00FF" and similar patterns from `Encodings.property`?

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Priority: Major
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to