[jira] [Comment Edited] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

2024-04-01 Thread Daehyeon Kim (Jira)


[ 
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829379#comment-17829379
 ] 

Daehyeon Kim edited comment on XALANC-743 at 4/2/24 5:51 AM:
-

I also met this problem. Appears to be a very similar issue to the issue 
reported in XALANJ-2419. If transforming when Unicode supplementary characters 
are included in the input, the output will be corrupted. Also confirmed that 
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
 
 
This problem appears for for both UTF-16 output encoding and UTF-8 output 
encoding. When using UTF16Writer, supplementary characters are converted to 
broken characters such as "??". If using UTF8Writer, a more serious problem 
arises where no output results are obtained. This suggests a higher-level issue 
than serialization, with something suspicious found in the XPath 
FunctionSubstring implementation. There's no issue with UTF16Writer and 
UTF8Writer.
 
 
XPath FunctionSubstring takes a character index position as an argument. A 
surrogate pair needs to be counted as one character (as per the XPath 
Recommendation). Thus, the string buffer positions where the surrogate pair is 
considered at the character index need to be counted. Currently, Xalan 
truncates the string buffer only with the arguments received for the character 
index. This will causing the truncated string to be corrupted when the 
surrogate pair is in the string buffer.
 
 
The same applies to this issue. If a string containing supplementary characters 
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the 
assertion in UTF8Writer "// We should never get a high or low surrogate 
here..."). Therefore, fixing XPath FunctionSubstring to count the string data 
length considering the surrogate pair resolves this issue automatically.
 
 
Please refer to the pull request for the changed code.


was (Author: JIRAUSER304680):
I also met this problem. Appears to be a very similar issue to the issue 
reported in XALANJ-2419. If transforming when Unicode supplementary characters 
are included in the input, the output will be corrupted. Also confirmed that 
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
 
 
This problem appears for for both UTF-16 output encoding and UTF-8 output 
encoding. When using UTF16Writer, supplementary characters are converted to 
broken characters such as "??". If using UTF8Writer, a more serious problem 
arises where no output results are obtained. This suggests a higher-level issue 
than serialization, with something suspicious found in the XPath 
FunctionSubstring implementation. There's no issue with UTF16Writer and 
UTF8Writer.
 
 
XPath FunctionSubstring takes a character index position as an argument. A 
surrogate pair needs to be counted as one character (as per the XPath 
Recommendation). Thus, the string buffer positions where the surrogate pair is 
considered at the character index need to be counted. Currently, Xalan 
truncates the string buffer only with the arguments received for the character 
index, causing the truncated string to be corrupted when the surrogate pair is 
in the string buffer.
 
 
The same applies to this issue. If a string containing supplementary characters 
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the 
assertion in UTF8Writer "// We should never get a high or low surrogate 
here..."). Therefore, fixing XPath FunctionSubstring to count the string data 
length considering the surrogate pair resolves this issue automatically.
 
 
Please refer to the pull request for the changed code.

> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till 
> out of memory
> ---
>
> Key: XALANC-743
> URL: https://issues.apache.org/jira/browse/XALANC-743
> Project: XalanC
>  Issue Type: Bug
>  Components: XalanC
>Affects Versions: 1.10
> Environment: Linux
>Reporter: Jiangbei Fan
>Assignee: Steven J. Hathaway
>Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the 
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in 
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last 
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The 
> XalanOutputStream::transcode will fall into an infinite loop till it is out 
> of memory. As XMLUTF8Transcoder.cpp in xerces will not consume the last 
> 2-bytes if it is part of 4 byte unicode. And transcode always loop until all 
> chars 

[jira] [Comment Edited] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

2024-04-01 Thread Daehyeon Kim (Jira)


[ 
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829379#comment-17829379
 ] 

Daehyeon Kim edited comment on XALANC-743 at 4/2/24 5:48 AM:
-

I also met this problem. Appears to be a very similar issue to the issue 
reported in XALANJ-2419. If transforming when Unicode supplementary characters 
are included in the input, the output will be corrupted. Also confirmed that 
both Xalan-C++ versions 1.10 to 1.12 have the same problem.
 
 
This problem appears for for both UTF-16 output encoding and UTF-8 output 
encoding. When using UTF16Writer, supplementary characters are converted to 
broken characters such as "??". If using UTF8Writer, a more serious problem 
arises where no output results are obtained. This suggests a higher-level issue 
than serialization, with something suspicious found in the XPath 
FunctionSubstring implementation. There's no issue with UTF16Writer and 
UTF8Writer.
 
 
XPath FunctionSubstring takes a character index position as an argument. A 
surrogate pair needs to be counted as one character (as per the XPath 
Recommendation). Thus, the string buffer positions where the surrogate pair is 
considered at the character index need to be counted. Currently, Xalan 
truncates the string buffer only with the arguments received for the character 
index, causing the truncated string to be corrupted when the surrogate pair is 
in the string buffer.
 
 
The same applies to this issue. If a string containing supplementary characters 
is incorrectly truncated in FunctionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer (as evident in the 
assertion in UTF8Writer "// We should never get a high or low surrogate 
here..."). Therefore, fixing XPath FunctionSubstring to count the string data 
length considering the surrogate pair resolves this issue automatically.
 
 
Please refer to the pull request for the changed code.


was (Author: JIRAUSER304680):
This appears to be a very similar issue to the issue reported in XALANJ-2419. 
If you transform when Unicode supplementary characters is included in the 
input, the output will be corrupted. I also confirmed that both Xalan-C++ 
versions 1.10 to 1.12 have the same problem.
 
I fixed this problem. This problem appears for both UTF-16 output encoding and 
UTF-8 output encoding. When using UTF16Writer, supplementary characters are 
converted to broken characters such as "??". If you use UTF8Writer you will 
have a more serious problem where you will not even get broken characters and 
transform will get no output results. This means there may be a problem at a 
higher level than serialization, and I found something suspicious in the XPath 
FunctionSubstring implementation. There is no problem with UTF16Writer and 
UTF8Writer.
 
XPath FunctionSubstring takes a character index position as an argument. And a 
surrogate pair needs to be counted as one character (as the XPath 
Recommendation ([https://www.w3.org/TR/xpath-functions-31/#func-substring])). 
So we need to count the string buffer positions where the surrogate pair is 
considered at the character index. But now xalan truncates the string buffer 
only with the arguments received the character index. This will cause the 
truncated string to be corrupted when the surrogate pair is in the string 
buffer.
 
The same goes for this issue. If a string containing supplementary characters 
is incorrectly truncated in FunstionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer..(and this can also be 
seen in the assertion in UTF8Writer "// We should never get a high or low 
surrogate here...").
So I fixed the XPath FunctionSubstring to count the string data length 
considering the surrogate pair. Then this issue will be automatically resolved.
 
Please refer to the pull request to see the changed code.

> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till 
> out of memory
> ---
>
> Key: XALANC-743
> URL: https://issues.apache.org/jira/browse/XALANC-743
> Project: XalanC
>  Issue Type: Bug
>  Components: XalanC
>Affects Versions: 1.10
> Environment: Linux
>Reporter: Jiangbei Fan
>Assignee: Steven J. Hathaway
>Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the 
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in 
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last 
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The 
> XalanOutputStream::transcode will fall into an infinite loop till it is out 
> of memory. As 

[jira] [Commented] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory

2024-03-20 Thread Daehyeon Kim (Jira)


[ 
https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829379#comment-17829379
 ] 

Daehyeon Kim commented on XALANC-743:
-

This appears to be a very similar issue to the issue reported in XALANJ-2419. 
If you transform when Unicode supplementary characters is included in the 
input, the output will be corrupted. I also confirmed that both Xalan-C++ 
versions 1.10 to 1.12 have the same problem.
 
I fixed this problem. This problem appears for both UTF-16 output encoding and 
UTF-8 output encoding. When using UTF16Writer, supplementary characters are 
converted to broken characters such as "??". If you use UTF8Writer you will 
have a more serious problem where you will not even get broken characters and 
transform will get no output results. This means there may be a problem at a 
higher level than serialization, and I found something suspicious in the XPath 
FunctionSubstring implementation. There is no problem with UTF16Writer and 
UTF8Writer.
 
XPath FunctionSubstring takes a character index position as an argument. And a 
surrogate pair needs to be counted as one character (as the XPath 
Recommendation ([https://www.w3.org/TR/xpath-functions-31/#func-substring])). 
So we need to count the string buffer positions where the surrogate pair is 
considered at the character index. But now xalan truncates the string buffer 
only with the arguments received the character index. This will cause the 
truncated string to be corrupted when the surrogate pair is in the string 
buffer.
 
The same goes for this issue. If a string containing supplementary characters 
is incorrectly truncated in FunstionSubstring, surrogate pairs (which cannot 
appear in UTF-8) may surprisingly appear in UTF8Writer..(and this can also be 
seen in the assertion in UTF8Writer "// We should never get a high or low 
surrogate here...").
So I fixed the XPath FunctionSubstring to count the string data length 
considering the surrogate pair. Then this issue will be automatically resolved.
 
Please refer to the pull request to see the changed code.

> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till 
> out of memory
> ---
>
> Key: XALANC-743
> URL: https://issues.apache.org/jira/browse/XALANC-743
> Project: XalanC
>  Issue Type: Bug
>  Components: XalanC
>Affects Versions: 1.10
> Environment: Linux
>Reporter: Jiangbei Fan
>Assignee: Steven J. Hathaway
>Priority: Major
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the 
> input/stylesheet contains 4-byte unicode. And I traced down the root cause in 
> XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last 
> XalanDOMChar in the buffer is the first 2 bytes of a 4-byte unicode char. The 
> XalanOutputStream::transcode will fall into an infinite loop till it is out 
> of memory. As XMLUTF8Transcoder.cpp in xerces will not consume the last 
> 2-bytes if it is part of 4 byte unicode. And transcode always loop until all 
> chars in the buffer is eaten. Specifically this will happen when the last 
> XalanDOMChar  in the input buffer is between 0xD800 and 0xDBFF.
> I cannot find whether this issue has been reported before. This is version 
> 1.10.  I do have a fix to add a bool reference to the function, so that the 
> caller can push the last 2 byte back to the buffer if not consumed. But want 
> to check it out before submit any fixes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org