[ 
https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811608#comment-17811608
 ] 

Joseph Kessselman commented on XALANJ-2725:
-------------------------------------------

Encodings aren't the only place where we use lookahead scanning. CDataSection 
termination also looks at `ch[i+1]` and `ch[i+2]` when recognizing the `]]>` 
delimiter sequence.  `writeUTF16Surrogate` is designed around lookahead.

Doublechecked the sax API description of `characters()`, and it says inter alia:



Individual characters may consist of more than one Java {{char}} value. There 
are two important cases where this happens, because characters can't be 
represented in just sixteen bits. In one case, characters are represented in a 
{_}Surrogate Pair{_}, using two special Unicode values. Such characters are in 
the so-called "Astral Planes", with a code point above U+FFFF. A second case 
involves composite characters, such as a base character combining with one or 
more accent characters.

Your code should not assume that algorithms using {{{}char{}}}-at-a-time idioms 
will be working in character units; in some cases they will split characters. 
This is relevant wherever XML permits arbitrary characters, such as attribute 
values, processing instruction data, and comments as well as in data reported 
from this method. It's also generally relevant whenever Java code manipulates 
internationalized text; the issue isn't unique to XML.

Hadn't even thought about the combining characters. If I remember correctly, 
those appear after the base character they're being applied to, I don't 
remember at all how it's decided if they are in any given encoding. Can we 
continue to just treat them as Unicode characters and assume the right things 
will happen?

> Possible buffer-boundry issue when serializing surrogate pairs
> --------------------------------------------------------------
>
>                 Key: XALANJ-2725
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2725
>             Project: XalanJ2
>          Issue Type: Improvement
>      Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization
>            Reporter: Joe Kesselman
>            Assignee: Joe Kesselman
>            Priority: Major
>              Labels: Surrogates, escaping, unicode, utf
>         Attachments: astral-chars-split-buffer.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a 
> surrogate pair (two UTF-16 units), were not being serialized correctly. We 
> have a proposed fix for that.
> There is reported to still be an edge case when a surrogate pair which 
> crosses buffer boundaries might not be handled correctly. [~maxfortun] 
> offered what looks like a reasonable proposed fix 
> (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607),
>  but in my testing this was not serializing the surrogate pairs correctly, 
> causing regression on the tests XALANJ-2419 introduced. I don't know whether 
> that's because we're taking multiple paths through
> But the edge case does appear to be real, and if so we will need some such 
> solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

Reply via email to