[
https://issues.apache.org/jira/browse/IO-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585387#comment-13585387
]
Thomas Neidhart edited comment on IO-356 at 2/24/13 3:33 PM:
-------------------------------------------------------------
In the reset method we have to modify also the bbuf variable, as it actually
contains the data to be read.
If you take a look at the fillBuffer method, it actually fills bbuf with as
many data from cbuf as bbuf can hold.
A simple scenario, input string is (AAAAABBBBB), bbuf has a buffer size of 10:
{noformat}
is.mark();
byte[] data = new byte[5];
is.read(data);
{noformat}
data will contain "AAAAA", but bbuf will contain the full input string
(AAAAABBBBB).
When we now call:
{noformat}
is.reset();
is.read(data);
{noformat}
we just reposition cbuf, but we continue to read from bbuf, thus the result
will be "BBBBB".
I think calling bbuf.limit(0) is correct and simple, although it might be
possible to improve it.
Regarding the other failing unit tests:
We do call the encode method with the endOfInput flag always set to true. This
means we have to reset the coder the next time we use it (calling also flush is
advised according to javadoc of CharsetEncoder):
{noformat}
private void fillBuffer() throws CharacterCodingException {
this.bbuf.compact();
this.encoder.reset();
final CoderResult result = this.encoder.encode(this.cbuf, this.bbuf,
true);
this.encoder.flush(bbuf);
if (result.isError()) {
result.throwException();
}
this.bbuf.flip();
}
{noformat}
This was probably introduced as the CharsetEncoder always precedes the data
with with the byte-order mark when using UTF-16 charset.
In that way, the BOM is only output the first time the encoder is called, but
it also means that mark/reset will not work with such charsets, as subsequent
calls will not generate the BOM again.
I do not know how to fix this in a clean way atm, but I would consider the
CharSequenceInputStream for UTF-16 charset as broken and document it.
was (Author: tn):
In the reset method we have to modify also the bbuf variable, as it
actually contains the data to be read.
If you take a look at the fillBuffer method, it actually fills bbuf with as
many data from cbuf as bbuf can hold.
A simple scenario, input string is (AAAAABBBBB), bbuf has a buffer size of 10:
{noformat}
is.mark();
byte[] data = new byte[5];
is.read(data);
{noformat}
data will contain "AAAAA", but bbuf will contain the full input string
(AAAAABBBBB).
When we now call:
{noformat}
is.reset();
is.read(data);
{noformat}
we just reposition cbuf, but we continue to read from bbuf, thus the result
will be "BBBBB".
I think calling bbuf.limit(0) is correct and simple, although it might be
possible to improve it.
Regarding the other failing unit tests:
We do call the encode method with the endOfInput flag always set to true. This
means we have to reset the coder the next time we use it (calling also flush is
advised according to javadoc of CharsetEncoder):
{noformat}
private void fillBuffer() throws CharacterCodingException {
this.bbuf.compact();
this.encoder.reset();
final CoderResult result = this.encoder.encode(this.cbuf, this.bbuf,
true);
this.encoder.flush(bbuf);
if (result.isError()) {
result.throwException();
}
this.bbuf.flip();
}
{noformat}
this error was silently ignored as we created the encoder like this:
{noformat}
this.encoder = charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
{noformat}
Another thing that I encountered, the CharsetEncoder always precedes the data
with with the byte-order mark when using UTF-16 charset, so we might get
strange behavior in the output bytes, as the byte-order mark may also appear
multiple times (everytime the fillBuffer method is called) in the middle of the
data. The number of bytes actually read do not correspond to the bytes from the
input, as the BOM is inserted. So imho, any charset that produces a BOM does
not work for the CharSequenceInputStream atm, and I have not yet an idea how to
fix this.
> CharSequenceInputStream#reset() behaves incorrectly in case when buffer size
> is not dividable by data size
> ----------------------------------------------------------------------------------------------------------
>
> Key: IO-356
> URL: https://issues.apache.org/jira/browse/IO-356
> Project: Commons IO
> Issue Type: Bug
> Components: Streams/Writers
> Affects Versions: 2.4
> Reporter: Dmitry Katsubo
> Attachments: CharSequenceInputStreamTest.java
>
>
> The size effect happens when buffer size of input stream is not dividable by
> requested data size. The bug is hidden in {{CharSequenceInputStream#reset()}}
> method which should also call (I think) {{bbuf.limit(0)}} otherwise next call
> to {{CharSequenceInputStream#read()}} will return the remaining tail which
> {{bbuf}} has accumulated.
> In the attached test case the test fails, if {{dataSize = 13}} (not dividable
> by 10) and runs OK if {{dataSize = 20}} (dividable by 10).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira