[
https://issues.apache.org/jira/browse/IO-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267534#comment-17267534
]
Gary D. Gregory commented on IO-638:
------------------------------------
[~thayne2]
Thank you for your report.
Please feel free to provide a PR on GitHub with a unit test.
> Infinite loop in CharSequenceInputStream.read for 4-byte characters with
> UTF-8 and 3-byte buffer.
> -------------------------------------------------------------------------------------------------
>
> Key: IO-638
> URL: https://issues.apache.org/jira/browse/IO-638
> Project: Commons IO
> Issue Type: Bug
> Components: Streams/Writers
> Affects Versions: 2.6
> Reporter: Thayne McCombs
> Priority: Major
>
> In the constructor of `CharSequenceInputStream` there is the following code
> to ensure the buffer is large enough to hold one character:
> {code:java}
> // Ensure that buffer is long enough to hold a complete character
> final float maxBytesPerChar = encoder.maxBytesPerChar();
> if (bufferSize < maxBytesPerChar) {
> throw new IllegalArgumentException("Buffer size " + bufferSize + " is
> less than maxBytesPerChar " +
> maxBytesPerChar);
> }
> {code}
> However, for UTF-8, `maxBytesPerChar` returns 3.0 not 4.0, even though some
> characters (such as emoji) require 4 bytes to encode. As a result you can
> create a `CharSequenceInputStream` with a buffer size of 3, but when
> attempting to fill the buffer, `CharsetEncoder.encode` will succeed with an
> OVERFLOW result without actually writing anything to buffer if attempting to
> encode a 4 byte character. This in turn results in an infinite loop in read
> methods, since the buffer never actually gets anything written to it.
>
> NOTE: as I understand it, the reason the encoder returns 3 and not 4 is
> because 3 is the maximum number of byte that a single java `char` can
> represent, since a 4 byte encoding in UTF-8 would require two a surragate
> pair of two `char`s.
>
> This is may be a problem for other encodings as well, but I've only tested it
> for utf-8.
>
> Requiring the buffer to be at least twice the maxBytesPerChar would ensure
> this doesn't happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)