[
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829379#comment-17829379
]
Daehyeon Kim commented on XALANC-743:
-
This appears to be a very similar issue to the issue reported in XALANJ-2419.
If you transform when Unicode supplementary characters is included in the
input, the output will be corrupted. I also confirmed that both Xalan-C++
versions 1.10 to 1.12 have the same problem.
I fixed this problem. This problem appears for both UTF-16 output encoding and
UTF-8 output encoding. When using UTF16Writer, supplementary characters are
converted to broken characters such as "??". If you use UTF8Writer you will
have a more serious problem where you will not even get broken characters and
transform will get no output results. This means there may be a problem at a
higher level than serialization, and I found something suspicious in the XPath
FunctionSubstring implementation. There is no problem with UTF16Writer and
UTF8Writer.
XPath FunctionSubstring takes a character index position as an argument. And a
surrogate pair needs to be counted as one character (as the XPath
Recommendation ([https://www.w3.org/TR/xpath-functions-31/#func-substring])).
So we need to count the string buffer positions where the surrogate pair is
considered at the character index. But now xalan truncates the string buffer
only with the arguments received the character index. This will cause the
truncated string to be corrupted when the surrogate pair is in the string
buffer.
The same goes for this issue. If a string containing supplementary characters
is incorrectly truncated in FunstionSubstring, surrogate pairs (which cannot
appear in UTF-8) may surprisingly appear in UTF8Writer..(and this can also be
seen in the assertion in UTF8Writer "// We should never get a high or low
surrogate here...").
So I fixed the XPath FunctionSubstring to count the string data length
considering the surrogate pair. Then this issue will be automatically resolved.
Please refer to the pull request to see the changed code.
> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till
> out of memory
> ---
>
> Key: XALANC-743
> URL: https://issues.apache.org/jira/browse/XALANC-743
> Project: XalanC
> Issue Type: Bug
> Components: XalanC
>Affects Versions: 1.10
> Environment: Linux
>Reporter: Jiangbei Fan
>Assignee: Steven J. Hathaway
>Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The
> XalanOutputStream::transcode will fall into an infinite loop till it is out
> of memory. As XMLUTF8Transcoder.cpp in xerces will not consume the last
> 2-bytes if it is part of 4 byte unicode. And transcode always loop until all
> chars in the buffer is eaten. Specifically this will happen when the last
> XalanDOMChar in the input buffer is between 0xD800 and 0xDBFF.
> I cannot find whether this issue has been reported before. This is version
> 1.10. I do have a fix to add a bool reference to the function, so that the
> caller can push the last 2 byte back to the buffer if not consumed. But want
> to check it out before submit any fixes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org