[
https://issues.apache.org/jira/browse/IO-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793635#comment-17793635
]
Miguel Munoz edited comment on IO-781 at 12/6/23 11:01 AM:
-----------------------------------------------------------
[~Marcono1234] Does this only happen with Surrogate Pairs? If so, I'm not sure
how to fix it, but before we do, I wonder if it would make sense to add a note
in the documentation that currently, the available method doesn't work when the
text includes surrogate pairs. These code points are not for Unicode
characters. For information on surrogate pairs, read Unicode's description at
[Glossary (unicode.org)|https://www.unicode.org/glossary/]
Surrogates are code points, but not unicode characters. They have a distinct
range, from D800 to DBFF for the first code point, and from DC00 to DFFFF for
the second code point. They always come in pairs or they don't work. So it
shouldn't be hard to scan and identify them. If I understand them correctly,
each surrogate pair, such as "\uD800\uDC00" decodes to a single character.
was (Author: [email protected]):
[~Marcono1234] Does this only happen with Surrogate Pairs? If so, I'm not sure
how to fix it, but before we do, I wonder if it would make sense to add a note
in the documentation that currently, the available method doesn't work when the
text includes surrogate pairs. These code points are not for Unicode
characters. For information on surrogate pairs, read Unicode's description at
[Glossary (unicode.org)|https://www.unicode.org/glossary/]
Surrogates have a distinct range, from D800 to DBFF for the first code point,
and from DC00 to DFFFF for the second code point. They always come in pairs or
they don't work. So it shouldn't be hard to scan and identify them. If I
understand them correctly, each surrogate pair, such as "\uD800\uDC00" decodes
to a single character.
> CharSequenceInputStream.available() returns too large numbers in some cases
> ---------------------------------------------------------------------------
>
> Key: IO-781
> URL: https://issues.apache.org/jira/browse/IO-781
> Project: Commons IO
> Issue Type: Bug
> Components: Streams/Writers
> Affects Versions: 2.11.0
> Reporter: Marcono1234
> Priority: Major
>
> h3. Description
> The {{available()}} method of
> {{org.apache.commons.io.input.CharSequenceInputStream}} erroneously returns
> values larger than the actual number of available bytes in some cases.
> The underlying issue is that {{CharSequenceInputStream}} makes incorrect
> assumptions about the relation between chars and bytes. The
> {{CodingErrorAction.REPLACE}} can convert 2 chars (1 supplementary code
> point) to one byte (the replacement char {{?}}). Additionally in case
> {{CharSequenceInputStream}} is ever extended to support specifying a
> {{CharsetEncoder}}, the {{CodingErrorAction.IGNORE}} would probably cause
> similar issues. There might also be some uncommon charsets which can encode 2
> chars to 1 byte; though I am not aware of such charset yet.
> This was originally mentioned in pull request
> [#293|https://github.com/apache/commons-io/pull/293]. That PR also proposed
> to replace the underlying {{CharSequenceInputStream}} implementation with
> {{ReaderInputStream}} because in general using {{CharsetEncoder}} is
> error-prone so it might be good to avoid having two classes implementing
> logic on top of it. (Potentially {{CharSequenceInputStream}} is missing a
> call to {{CharsetEncoder.flush}}, see also IO-714)
> h3. Example
> In the example below {{available()}} erroneously returns 2 even though only 1
> byte can be read.
> {code}
> Charset charset = Charset.forName("Big5");
> CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00",
> charset);
> // BUG: available() returns 2 but only 1 byte is read afterwards
> System.out.println("Available: " + in.available());
> // Note: readAllBytes() is a method added in Java 9
> System.out.println("Actually read: " + in.readAllBytes().length);
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)