[GitHub] [commons-io] Marcono1234 commented on pull request #293: Refactor `ReaderInputStream` implementation

GitBox Sat, 16 Apr 2022 16:50:57 -0700


Marcono1234 commented on PR #293:
URL: https://github.com/apache/commons-io/pull/293#issuecomment-1100772224


   Sorry for not having responded earlier, and for not properly following up on 
this.
   
   The issues mentioned in the description still exist on `master`:
   - `ReaderInputStream` erroneously overwrites `lastCoderResult`
   This can for example be seen with the following code snippet. The string has 
a trailing unpaired surrogate, which is therefore malformed. The encoder is 
created to throw on malformed input. However, currently `read()` erroneously 
just returns -1 instead of throwing an exception.
       ```java
       // Encoder which throws on malformed or unmappable input
       CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
       ReaderInputStream in = new ReaderInputStream(new StringReader("\uD800"), 
encoder);
       System.out.println("Read: " + in.read());
       ```
   - `CharSequenceInputStream.available()` makes erroneous assumptions about 
conversion ratio
   Currently the implementation assumes that for each `char` at least one 
`byte` will be encoded. This might be incorrect for some specific encodings 
(have not found an example for this yet though), but is definitely incorrect 
for the default malformed input action of `REPLACE`. With this, encodings 
replace supplementary code points (consisting of two `char`s) with a single 
`byte`. This can been seen here:
       ```java
       Charset charset = Charset.forName("Big5");
       CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00", 
charset);
       System.out.println("Available: " + in.available());
       // Note: readAllBytes() is a method added in Java 9
       System.out.println("Actually read: " + in.readAllBytes().length);
       ```
       
       `available()` returns 2, but only 1 byte can be read.
   
   I asked for feedback for the following reasons, but did not describe them 
very well, or at all, in the original description:
   - I was not sure what your opinion about these larger refactorings are
   - The refactored `CharSequenceInputStream` made the `available()` 
implementation more correct (see above mentioned issue), but made the results 
less useful. This is also why the tests were failing because they were still 
expecting specific `available()` results, which cannot necessarily be 
guaranteed anymore. The tests could be rewritten to only check the 
`available()` result against the maximum, but not impose any minimum anymore. 
But I first wanted to hear your opinion on whether that is fine.
   - The new `CharSequenceInputStream` implementation is basically a wrapped 
`ReaderInputStream` with a bunch of boilerplate code to support `mark` and 
`reset`. Additionally the new `mark` implementation considers its parameter 
value now, which differs from the previous behavior. Is this new 
`CharSequenceInputStream` implementation reasonable?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-io] Marcono1234 commented on pull request #293: Refactor `ReaderInputStream` implementation

Reply via email to