[jira] [Commented] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs

Joe Kesselman (Jira) Tue, 23 Jan 2024 18:43:03 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810175#comment-17810175
 ]


Joe Kesselman commented on XALANJ-2725:
---------------------------------------

Unfortunately, no, the patch doesn't seem to be working for me. For both 
ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports:


{code:java}
<checkresult result="Fail" desc="Astral characters should come out 
unscathed"/>{code}
{{{}
{}}}I should admit that I don't, at first glance, see +why+ it's failing; might 
be time to fire up the debugger and watch it fail.{{{}
{}}}

 
----
 

Note that we have some annoyingly parallel solutions in this class – 
isHighSurrogate() is tested five different places under multiple serializing 
loops, One of them, in the *characters* method, actually does admit that 
there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of 
my copy:
{{      }}
{code:java}
else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && 
Encodings.isLowUTF16Surrogate(chars[i+1])) {{code}

though it doesn't do more than dump the surrogates as Numeric Character 
Entities when the boundary is crossed.  The others (two in 
*writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in 
{*}writeAttrString{*}) appear to blithely assume that if buffer division is 
possible, that's been done on Unicode Character, rather than UTF16 unit, 
boundaries. Which is what I would expect to happen in most of our code. But 
mistakes get made, and the serializer APIs can be invoked from code other than 
Xalan, so guarding against this isn't an unreasonable idea.

In fact, that would probably be the right way to test this – write unit-test 
code that exercises ToStream directly, rather than trying to do so from the 
functional-test level.


----


Some off-the-cuff code review on your patch while I was glancing at it:

I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush 
it out before the "fallback plan". The former makes sense. The latter ... Well, 
ill-formed UTF16 isn't supposed to occur, so that combination really shouldn't 
happen. If it does, I'm not sure writing high out as a numeric character 
reference makes sense as anything but an error indication, in which case I'd be 
tempted to write it as &ISOLATED_HIGH_SURROGATE_nnnn; to make clear that this 
is what's going on.

If we're going to assume that isolated surrogates are possible at all, your 
code risks combining them into a single character that never actually existed, 
since the high-surrogate cache isn't cleared until it's used. That could be 
hard to diagnose. ("Spooky action at a distance"?) I dislike spending the 
cycles, but we might want to make sure the high surrogate doesn't outlast the 
next UTF16 unit even if it isn't a low surrogate.

Or we can assert that providing correct UTF16 input is the responsibility of 
the users, and  sweep the whole issue of isolated surrogates under the carpet.


One more thought: Do we really need to construct a Character to cache a 
surrogate? Couldn't we just stash the numeric value (unsigned short?), with 0 
acting as the "none" case rather than null?  Object churn is generally a Bad 
thing in inner loops.

> Possible buffer-boundry issue when serializing surrogate pairs
> --------------------------------------------------------------
>
>                 Key: XALANJ-2725
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2725
>             Project: XalanJ2
>          Issue Type: Improvement
>      Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization
>            Reporter: Joe Kesselman
>            Assignee: Joe Kesselman
>            Priority: Major
>              Labels: Surrogates, escaping, unicode, utf
>         Attachments: astral-chars-split-buffer.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a 
> surrogate pair (two UTF-16 units), were not being serialized correctly. We 
> have a proposed fix for that.
> There is reported to still be an edge case when a surrogate pair which 
> crosses buffer boundaries might not be handled correctly. [~maxfortun] 
> offered what looks like a reasonable proposed fix 
> (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607),
>  but in my testing this was not serializing the surrogate pairs correctly, 
> causing regression on the tests XALANJ-2419 introduced. I don't know whether 
> that's because we're taking multiple paths through
> But the edge case does appear to be real, and if so we will need some such 
> solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Commented] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs

Reply via email to