Hi all,
I have been facing issues with ResettableFileInputStream when trying to
read Unicode characters outside the Basic Multilingual Plane.
Currently, when a surrogate pair appears in the byte stream, the decoder
fails to decode it and the implementation considers the stream to have
reached its end, stops reading, and returns -1.
Although there is evidence of some concern about surrogate pairs in the
code, the current implementation proves incapable of handling them
correctly, while no unit test addresses this situation specifically.
After some investigation, I found that the cause boils down to the fact
that bytes are decoded char by char, using a CharBuffer of capacity 1.
However, two chars forming a surrogate pair can only be decoded in one
single pass; as a consequence, the CharBuffer passed to the Decoder must
have at least 2 slots remaining, otherwise nothing is decoded, and no error
is raised. I'm attaching a small test that demonstrates this.
Has anyone experienced the same problem?
If this proves to be a bug, I could open a ticket and provide a patch with
unit tests to fix it.
Regards,
Alexandre Dutra
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
public class SurrogateDecodeTest {
/** this is U+1F618 (UTF-8: f0 9f 98 98) FACE THROWING A KISS */
public static final byte[] FACE_THROWING_A_KISS = new byte[]{(byte) 0xF0, (byte) 0x9F, (byte) 0x98, (byte) 0x98};
public static void main(String[] args) {
// this will print "bb 4, cb 0" -> byte buffer has not been consumed, char buffer has not been filled
decode(1);
// this will print "bb 0, cb 2" -> byte buffer has been fully consumed, char buffer has been filled with two chars
decode(2);
}
private static void decode(int size) {
CharsetDecoder decoder = Charset.forName("utf8").newDecoder();
ByteBuffer bb = ByteBuffer.wrap(FACE_THROWING_A_KISS);
CharBuffer cb = CharBuffer.allocate(size);
decoder.decode(bb, cb, true);
cb.flip();
System.out.println(String.format("bb %s, cb %s", bb.remaining(), cb.remaining()));
}
}