[ https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810175#comment-17810175 ]
Joe Kesselman commented on XALANJ-2725: --------------------------------------- Unfortunately, no, the patch doesn't seem to be working for me. For both ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports: {code:java} <checkresult result="Fail" desc="Astral characters should come out unscathed"/>{code} {{{} {}}}I should admit that I don't, at first glance, see +why+ it's failing; might be time to fire up the debugger and watch it fail.{{{} {}}} ---- Note that we have some annoyingly parallel solutions in this class – isHighSurrogate() is tested five different places under multiple serializing loops, One of them, in the *characters* method, actually does admit that there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of my copy: {{ }} {code:java} else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {{code} though it doesn't do more than dump the surrogates as Numeric Character Entities when the boundary is crossed. The others (two in *writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in {*}writeAttrString{*}) appear to blithely assume that if buffer division is possible, that's been done on Unicode Character, rather than UTF16 unit, boundaries. Which is what I would expect to happen in most of our code. But mistakes get made, and the serializer APIs can be invoked from code other than Xalan, so guarding against this isn't an unreasonable idea. In fact, that would probably be the right way to test this – write unit-test code that exercises ToStream directly, rather than trying to do so from the functional-test level. ---- Some off-the-cuff code review on your patch while I was glancing at it: I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush it out before the "fallback plan". The former makes sense. The latter ... Well, ill-formed UTF16 isn't supposed to occur, so that combination really shouldn't happen. If it does, I'm not sure writing high out as a numeric character reference makes sense as anything but an error indication, in which case I'd be tempted to write it as &ISOLATED_HIGH_SURROGATE_nnnn; to make clear that this is what's going on. If we're going to assume that isolated surrogates are possible at all, your code risks combining them into a single character that never actually existed, since the high-surrogate cache isn't cleared until it's used. That could be hard to diagnose. ("Spooky action at a distance"?) I dislike spending the cycles, but we might want to make sure the high surrogate doesn't outlast the next UTF16 unit even if it isn't a low surrogate. Or we can assert that providing correct UTF16 input is the responsibility of the users, and sweep the whole issue of isolated surrogates under the carpet. One more thought: Do we really need to construct a Character to cache a surrogate? Couldn't we just stash the numeric value (unsigned short?), with 0 acting as the "none" case rather than null? Object churn is generally a Bad thing in inner loops. > Possible buffer-boundry issue when serializing surrogate pairs > -------------------------------------------------------------- > > Key: XALANJ-2725 > URL: https://issues.apache.org/jira/browse/XALANJ-2725 > Project: XalanJ2 > Issue Type: Improvement > Security Level: No security risk; visible to anyone(Ordinary problems in > Xalan projects. Anybody can view the issue.) > Components: Serialization > Reporter: Joe Kesselman > Assignee: Joe Kesselman > Priority: Major > Labels: Surrogates, escaping, unicode, utf > Attachments: astral-chars-split-buffer.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a > surrogate pair (two UTF-16 units), were not being serialized correctly. We > have a proposed fix for that. > There is reported to still be an edge case when a surrogate pair which > crosses buffer boundaries might not be handled correctly. [~maxfortun] > offered what looks like a reasonable proposed fix > (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607), > but in my testing this was not serializing the surrogate pairs correctly, > causing regression on the tests XALANJ-2419 introduced. I don't know whether > that's because we're taking multiple paths through > But the edge case does appear to be real, and if so we will need some such > solution. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org