[jira] [Commented] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs

Joseph Kessselman (Jira) Sat, 27 Jan 2024 13:12:06 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811593#comment-17811593
 ]


Joseph Kessselman commented on XALANJ-2725:
-------------------------------------------

Can you share those edge cases?

I've been looking at this, and for that one case it addresses, it does work. 
Problem is, I'm looking at the implied additional complications...

Ideally, if we're going to the trouble of outputting isolated low surrogates as 
NCRs, we should also output isolated high surrogates. That means if we get a 
high surrogate NOT followed by a low, we should probably dump it as an NCR 
before proceeding. It will still be wrong, of course, but it will at least be 
diagnostically useful, and the test (!=0) is relatively cheap. That may require 
moving the high-surrogate handling to the front of the loop. ON THE OTHER HAND, 
we do in fact have an error message for those cases, as seen in 
{*}writeUTF16Surrogate{*}: 

'''
            throw new IOException(
                Utils.messages.createMessage(
                    MsgKey.ER_INVALID_UTF16_SURROGATE,
                    new Object[] \{ Integer.toHexString((int) c)}));
        }
'''


There's still a question about how many places we need to make this change. 
isHighUTF16Surrogate is tested five different places in *ToStream* in the 
current master code. There are fourteen separate places where we write out NCEs 
(in
{*}accumDefaultEscape(), characters(), writeAttrString(), 
writeNormalizedChars(), writeUTF16Surrogate(){*}). Some of them are also doing 
the lookahead, some (as noted above) simply fail if the UTF16 is not 
well-formed.

It's starting to feel like we should put possibly rationalizing this whole 
class on our backlog.

Admittedly some of the duplications may have been performance optimizations. 
But gods only know whether they still are, if so.

> Possible buffer-boundry issue when serializing surrogate pairs
> --------------------------------------------------------------
>
>                 Key: XALANJ-2725
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2725
>             Project: XalanJ2
>          Issue Type: Improvement
>      Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization
>            Reporter: Joe Kesselman
>            Assignee: Joe Kesselman
>            Priority: Major
>              Labels: Surrogates, escaping, unicode, utf
>         Attachments: astral-chars-split-buffer.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a 
> surrogate pair (two UTF-16 units), were not being serialized correctly. We 
> have a proposed fix for that.
> There is reported to still be an edge case when a surrogate pair which 
> crosses buffer boundaries might not be handled correctly. [~maxfortun] 
> offered what looks like a reasonable proposed fix 
> (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607),
>  but in my testing this was not serializing the surrogate pairs correctly, 
> causing regression on the tests XALANJ-2419 introduced. I don't know whether 
> that's because we're taking multiple paths through
> But the edge case does appear to be real, and if so we will need some such 
> solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Commented] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs

Reply via email to