date:20240320

[jira] [Commented] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

2024-03-20 Thread Daehyeon Kim (Jira)



[ 
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829379#comment-17829379
 ] 

Daehyeon Kim commented on XALANC-743:
-

This appears to be a very similar issue to the issue reported in XALANJ-2419. 
If you transform when Unicode supplementary characters is included in the 
input, the output will be corrupted. I also confirmed that both Xalan-C++ 
versions 1.10 to 1.12 have the same problem.
 
I fixed this problem. This problem appears for both UTF-16 output encoding and 
UTF-8 output encoding. When using UTF16Writer, supplementary characters are 
converted to broken characters such as "??". If you use UTF8Writer you will 
have a more serious problem where you will not even get broken characters and 
transform will get no output results. This means there may be a problem at a 
higher level than serialization, and I found something suspicious in the XPath 
FunctionSubstring implementation. There is no problem with UTF16Writer and 
UTF8Writer.
 
XPath FunctionSubstring takes a character index position as an argument. And a 
surrogate pair needs to be counted as one character (as the XPath 
Recommendation ([https://www.w3.org/TR/xpath-functions-31/#func-substring])). 
So we need to count the string buffer positions where the surrogate pair is 
considered at the character index. But now xalan truncates the string buffer 
only with the arguments received the character index. This will cause the 
truncated string to be corrupted when the surrogate pair is in the string 
buffer.
 
The same goes for this issue. If a string containing supplementary characters 
is incorrectly truncated in FunstionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer..(and this can also be 
seen in the assertion in UTF8Writer "// We should never get a high or low 
surrogate here...").
So I fixed the XPath FunctionSubstring to count the string data length 
considering the surrogate pair. Then this issue will be automatically resolved.
 
Please refer to the pull request to see the changed code.

> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till 
> out of memory
> ---
>
> Key: XALANC-743
> URL: https://issues.apache.org/jira/browse/XALANC-743
> Project: XalanC
>  Issue Type: Bug
>  Components: XalanC
>Affects Versions: 1.10
> Environment: Linux
>Reporter: Jiangbei Fan
>Assignee: Steven J. Hathaway
>Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the 
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in 
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last 
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The 
> XalanOutputStream::transcode will fall into an infinite loop till it is out 
> of memory. As XMLUTF8Transcoder.cpp in xerces will not consume the last 
> 2-bytes if it is part of 4 byte unicode. And transcode always loop until all 
> chars in the buffer is eaten. Specifically this will happen when the last 
> XalanDOMChar  in the input buffer is between 0xD800 and 0xDBFF.
> I cannot find whether this issue has been reported before. This is version 
> 1.10.  I do have a fix to add a bool reference to the function, so that the 
> caller can push the last 2 byte back to the buffer if not consumed. But want 
> to check it out before submit any fixes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[PR] XALANC-743: transform with broken supplementary characters [xalan-c]

2024-03-20 Thread via GitHub



yhyacinth opened a new pull request, #42:
URL: https://github.com/apache/xalan-c/pull/42

   * Fix FunctionSubstring implementation to count the string data length 
considering the surrogate pair
   
   If you transform when Unicode supplementary characters is included in the 
input, the output will be corrupted. I also confirmed that both Xalan-C++ 
versions 1.10 to 1.12 have the same problem.
   
   I found something suspicious in the XPath FunctionSubstring implementation 
and fixed it.
   
   Please refer to the issue tracker for a detailed description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Commented] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

[PR] XALANC-743: transform with broken supplementary characters [xalan-c]

2 matches

Site Navigation

Mail list logo

Footer information