[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Jason Harrop (JIRA) Thu, 21 Feb 2019 17:56:19 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774677#comment-16774677
 ]


Jason Harrop commented on XALANJ-2419:
--------------------------------------

It works under Java 11 if I change makeStream("ISO-8859-1") to 
makeStream("ISO8859_1").

With makeStream("ISO-8859-1"), s.getBytes(encoding) throws 
UnsupportedEncodingException for encoding 8859-1 at

{code:java}
        EncodingInfo.inEncoding(char, String) line: 438 
        EncodingInfo$EncodingImpl.isInEncoding(char) line: 226  
        EncodingInfo$EncodingImpl.isInEncoding(char) line: 215  
        EncodingInfo.isInEncoding(char) line: 113       
        ToXMLStream(ToStream).characters(char[], int, int) line: 1597   
        ToXMLStream(ToStream).characters(String) line: 1774     
        ToXMLStreamTest(ToStreamTest).outputCharacters(ToStream, String) line: 
88       
        ToXMLStreamTest.testCase2() line: 114   
        NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not 
available [native method]  
        NativeMethodAccessorImpl.invoke(Object, Object[]) line: 62      
        DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 43  
        Method.invoke(Object, Object...) line: 566      
        Reporter.executeTests(Test, int, Object) line: 787      
        ToXMLStreamTest(FileBasedTest).runTestCases(Properties) line: 339       
        ToXMLStreamTest(TestImpl).runTest(Properties) line: 205 
        ToXMLStreamTest(FileBasedTest).doMain(String[]) line: 833       
        ToXMLStreamTest.main(String[]) line: 196        
{code}

Not related to 2419, but FYI there is one other test which fails, due to date 
formatting and http://openjdk.java.net/jeps/252

I've put the test code on GitHub; for Java 11 I am using 
https://github.com/plutext/xalan-test/tree/Plutext_Java11_xalan-j_2_7_x


> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Priority: Major
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to