[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Uwe Schindler (JIRA) Sun, 15 Apr 2018 15:49:45 -0700

    [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438848#comment-16438848
 ]


Uwe Schindler edited comment on XALANJ-2419 at 4/15/18 10:48 PM:
-----------------------------------------------------------------

Unfortunately the patch does not fix the problem for attributes. Those got 
better with it, but it outputs the correct char and then the second half char 
of the surrogate as decimal escape.

The Policeman Emoji is serialized with the patch correctly, if part of a text 
node. This is fixed by this patch.

But inside an attribute the policeman emoji comes out like:

{code:xml}
<img alt="Uwe 👮&#55357; Schindler" 
{code}

This seems to happen in ToHTMLStream.


was (Author: thetaphi):
Unfortunately the patch does not fix the problem for attributes. Those got 
better with it, but it outputs the correct char and then the second half char 
of the surrogate as decimal escape.

The Policeman Emoji is serialized with the patch correctly, if part of a text 
node. This is fixed by this patch.

But inside an attribute the policeman emoji comes out like:

{code:xml}
<img alt="Uwe 👮&#55357; Schindler" 
{code}

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Priority: Major
>         Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to