[jira] [Created] (IO-638) Infinite loop in CharSequenceInputStream.read for 4-byte characters with UTF-8 and 3-byte buffer.

Thayne McCombs (Jira) Thu, 14 Nov 2019 09:10:32 -0800

Thayne McCombs created IO-638:
---------------------------------

             Summary: Infinite loop in CharSequenceInputStream.read for 4-byte 
characters with UTF-8 and 3-byte buffer.
                 Key: IO-638
                 URL: https://issues.apache.org/jira/browse/IO-638
             Project: Commons IO
          Issue Type: Bug
          Components: Streams/Writers
    Affects Versions: 2.6
            Reporter: Thayne McCombs



In the constructor of `CharSequenceInputStream` there is the following code to 
ensure the buffer is large enough to hold one character:


{code:java}
 // Ensure that buffer is long enough to hold a complete character   
final float maxBytesPerChar = encoder.maxBytesPerChar();      
if (bufferSize < maxBytesPerChar) {
    throw new IllegalArgumentException("Buffer size " + bufferSize + " is less 
than maxBytesPerChar " +
    maxBytesPerChar);
}
{code}
However, for UTF-8, `maxBytesPerChar` returns 3.0 not 4.0, even though some 
characters (such as emoji) require 4 bytes to encode.  As a result you can 
create a `CharSequenceInputStream` with a buffer size of 3, but when attempting 
to fill the buffer, `CharsetEncoder.encode` will succeed with an OVERFLOW 
result without actually writing anything to buffer if attempting to encode a 4 
byte character. This in turn results in an infinite loop in read methods, since 
the buffer never actually gets anything written to it.

 

NOTE: as I understand it, the reason the encoder returns 3 and not 4 is because 
3 is the maximum number of byte that a single java `char` can represent, since 
a 4 byte encoding in UTF-8 would require two a surragate pair of two `char`s.

 

This is may be a problem for other encodings as well, but I've only tested it 
for utf-8.

 

Requiring the buffer to be at least twice the maxBytesPerChar would ensure this 
doesn't happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (IO-638) Infinite loop in CharSequenceInputStream.read for 4-byte characters with UTF-8 and 3-byte buffer.

Reply via email to