[jira] [Comment Edited] (IO-356) CharSequenceInputStream#reset() behaves incorrectly in case when buffer size is not dividable by data size

Thomas Neidhart (JIRA) Sun, 24 Feb 2013 07:34:14 -0800

    [ 
https://issues.apache.org/jira/browse/IO-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585387#comment-13585387
 ]


Thomas Neidhart edited comment on IO-356 at 2/24/13 3:33 PM:
-------------------------------------------------------------

In the reset method we have to modify also the bbuf variable, as it actually 
contains the data to be read.
If you take a look at the fillBuffer method, it actually fills bbuf with as 
many data from cbuf as bbuf can hold.

A simple scenario, input string is (AAAAABBBBB), bbuf has a buffer size of 10:

{noformat}
  is.mark();
  byte[] data = new byte[5];
  is.read(data);
{noformat}

data will contain "AAAAA", but bbuf will contain the full input string 
(AAAAABBBBB).
When we now call:

{noformat}
  is.reset();
  is.read(data);
{noformat}

we just reposition cbuf, but we continue to read from bbuf, thus the result 
will be "BBBBB".
I think calling bbuf.limit(0) is correct and simple, although it might be 
possible to improve it.

Regarding the other failing unit tests:

We do call the encode method with the endOfInput flag always set to true. This 
means we have to reset the coder the next time we use it (calling also flush is 
advised according to javadoc of CharsetEncoder):

{noformat}
    private void fillBuffer() throws CharacterCodingException {
        this.bbuf.compact();
        this.encoder.reset();
        final CoderResult result = this.encoder.encode(this.cbuf, this.bbuf, 
true);
        this.encoder.flush(bbuf);
        if (result.isError()) {
            result.throwException();
        }
        this.bbuf.flip();
    }
{noformat}

This was probably introduced as the CharsetEncoder always precedes the data 
with with the byte-order mark when using UTF-16 charset.
In that way, the BOM is only output the first time the encoder is called, but 
it also means that mark/reset will not work with such charsets, as subsequent 
calls will not generate the BOM again.

I do not know how to fix this in a clean way atm, but I would consider the 
CharSequenceInputStream for UTF-16 charset as broken and document it.
                
      was (Author: tn):
    In the reset method we have to modify also the bbuf variable, as it 
actually contains the data to be read.
If you take a look at the fillBuffer method, it actually fills bbuf with as 
many data from cbuf as bbuf can hold.

A simple scenario, input string is (AAAAABBBBB), bbuf has a buffer size of 10:

{noformat}
  is.mark();
  byte[] data = new byte[5];
  is.read(data);
{noformat}

data will contain "AAAAA", but bbuf will contain the full input string 
(AAAAABBBBB).
When we now call:

{noformat}
  is.reset();
  is.read(data);
{noformat}

we just reposition cbuf, but we continue to read from bbuf, thus the result 
will be "BBBBB".
I think calling bbuf.limit(0) is correct and simple, although it might be 
possible to improve it.

Regarding the other failing unit tests:

We do call the encode method with the endOfInput flag always set to true. This 
means we have to reset the coder the next time we use it (calling also flush is 
advised according to javadoc of CharsetEncoder):

{noformat}
    private void fillBuffer() throws CharacterCodingException {
        this.bbuf.compact();
        this.encoder.reset();
        final CoderResult result = this.encoder.encode(this.cbuf, this.bbuf, 
true);
        this.encoder.flush(bbuf);
        if (result.isError()) {
            result.throwException();
        }
        this.bbuf.flip();
    }
{noformat}

this error was silently ignored as we created the encoder like this:

{noformat}
        this.encoder = charset.newEncoder()
            .onMalformedInput(CodingErrorAction.REPLACE)
            .onUnmappableCharacter(CodingErrorAction.REPLACE);
{noformat}

Another thing that I encountered, the CharsetEncoder always precedes the data 
with with the byte-order mark when using UTF-16 charset, so we might get 
strange behavior in the output bytes, as the byte-order mark may also appear 
multiple times (everytime the fillBuffer method is called) in the middle of the 
data. The number of bytes actually read do not correspond to the bytes from the 
input, as the BOM is inserted. So imho, any charset that produces a BOM does 
not work for the CharSequenceInputStream atm, and I have not yet an idea how to 
fix this.
                  
> CharSequenceInputStream#reset() behaves incorrectly in case when buffer size 
> is not dividable by data size
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: IO-356
>                 URL: https://issues.apache.org/jira/browse/IO-356
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Streams/Writers
>    Affects Versions: 2.4
>            Reporter: Dmitry Katsubo
>         Attachments: CharSequenceInputStreamTest.java
>
>
> The size effect happens when buffer size of input stream is not dividable by 
> requested data size. The bug is hidden in {{CharSequenceInputStream#reset()}} 
> method which should also call (I think) {{bbuf.limit(0)}} otherwise next call 
> to {{CharSequenceInputStream#read()}} will return the remaining tail which 
> {{bbuf}} has accumulated.
> In the attached test case the test fails, if {{dataSize = 13}} (not dividable 
> by 10) and runs OK if {{dataSize = 20}} (dividable by 10).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (IO-356) CharSequenceInputStream#reset() behaves incorrectly in case when buffer size is not dividable by data size

Reply via email to