Re: Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

2022-05-19 Thread Claes Redestad

Yes, I think 20% is close to an upper bound, with less gain on smaller
strings due improved cache-locality. And as you say such a small
speed-up might not be very noticeable in practice when I/O is involved.

/Claes

On 2022-05-19 14:17, Glavo wrote:
Thank you very much for your work, it seems that the decoding speed of 
the new version of JDK has been greatly improved.


However, `hasNegatives` still has a cost on my device. I tried removing 
the call to hasNegatives in `String.encodeUTF8` based on JDK 17.0.3,
then I tested its performance through JMH. I ran the JMH benchmark on 
two of my x86 machines (The CPU of one of the machines is Ryzen 7 3700X 
and the other is Xeon W-2175)

and got about a 20% performance boost on both machines.

Of course, The decoding performance in the new JDK is better than I 
thought. For an ASCII string of length 1, hasNegatives takes about 
150 nanoseconds, and the duration increases linearly with the length.
Considering that these operations usually accompany IO, the overhead is 
really not as noticeable as I thought.


Thank you again for your contribution to OpenJDK and your patience in 
replying.


On Wed, May 18, 2022 at 10:47 PM Claes Redestad 
mailto:claes.redes...@oracle.com>> wrote:


Hi,

so your suggestion is that String::coder would be ASCII if all
codepoints are 0-127, then LATIN1 if it contains at least one char in
the 128-255 range, otherwise UTF16? As you say this would mean we could
skip scanning StringCoding::hasNegatives scans and similar overheads
when converting known-to-be-ASCII Strings to UTF-8 and other ASCII
compatible encoding. But it might also complicate logic in various
places, so we need a clear win to motivate such work.

So the first question to answer might be how much does the hasNegatives
calls cost us. Depending on your hardware (and JDK version) the answer
might be "almost nothing"!

I did some work last year to speed up encoding and decoding ASCII-only
data to/from various encodings. When I was done there was no longer much
of a difference when encoding from String to an ASCII-compatible charset
[1]. Maybe a scaling ~10% advantage for latin-1 in some microbenchmark
when decoding on really large strings[2].

But with the suggested change the decoding advantage would likely
disappear: we maintain the invariant that Strings are equals if and only
if their coders are equal, so no we'd now have to scan latin-1 encoded
streams for non-ASCII bytes to disambiguate between ASCII and LATIN1.

Maybe there are some other case that would be helped, but with some
likely negatives and only a modest potential win then my gut feeling is
that this wouldn't pay off.

Best regards
Claes

[1] https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html


[2] https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html



(All the improvements discussed in the blog entires should be available
since 17.0.2.)

On 2022-05-18 14:07, Glavo wrote:
 > After the introduction of compact strings in Java 9, the current
String may
 > store byte arrays encoded as Latin1 or UTF-16.
 > Here's a troublesome thing: Latin1 is not compatible with UTF-8.
Not all
 > Latin1 byte strings are legal UTF-8 byte strings:
 > They are only compatible within the ASCII range, when there are
code points
 > greater than 127, Latin1 uses a one-byte
 > representation, while UTF-8 requires two bytes.
 >
 > As an example, every time `JavaLangAccess::getBytesNoRepl` is
called to
 > convert a string to a UTF-8 array,
 > the internal implementation needs to call
`StringCoding.hasNegatives` to
 > scan the content byte array to determine that
 > the string can be encoded in ASCII, and thus eliminate array copies.
 > Similar steps are performed when calling `str.getBytes(UTF_8)`.
 >
 > So, is it possible to introduce a third possible value for
`String::coder`:
 > ASCII?
 > This looks like an attractive option, and if we do this, we can
introduce
 > fast paths for many methods.
 >
 > Of course, I know that this change is not completely free, and
some methods
 > may bring slight performance
 > degradation due to the need to judge the coder, in particular,
there may be
 > an impact on the performance of the
 > StringBuilder/StringBuffer. However, given that UTF-8 is by far
the most
 > commonly used file encoding, the performance
 >

Re: Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

2022-05-19 Thread Glavo
Thank you very much for your work, it seems that the decoding speed of the
new version of JDK has been greatly improved.

However, `hasNegatives` still has a cost on my device. I tried removing the
call to hasNegatives in `String.encodeUTF8` based on JDK 17.0.3,
then I tested its performance through JMH. I ran the JMH benchmark on two
of my x86 machines (The CPU of one of the machines is Ryzen 7 3700X and the
other is Xeon W-2175)
and got about a 20% performance boost on both machines.

Of course, The decoding performance in the new JDK is better than I
thought. For an ASCII string of length 1, hasNegatives takes about 150
nanoseconds, and the duration increases linearly with the length.
Considering that these operations usually accompany IO, the overhead is
really not as noticeable as I thought.

Thank you again for your contribution to OpenJDK and your patience in
replying.

On Wed, May 18, 2022 at 10:47 PM Claes Redestad 
wrote:

> Hi,
>
> so your suggestion is that String::coder would be ASCII if all
> codepoints are 0-127, then LATIN1 if it contains at least one char in
> the 128-255 range, otherwise UTF16? As you say this would mean we could
> skip scanning StringCoding::hasNegatives scans and similar overheads
> when converting known-to-be-ASCII Strings to UTF-8 and other ASCII
> compatible encoding. But it might also complicate logic in various
> places, so we need a clear win to motivate such work.
>
> So the first question to answer might be how much does the hasNegatives
> calls cost us. Depending on your hardware (and JDK version) the answer
> might be "almost nothing"!
>
> I did some work last year to speed up encoding and decoding ASCII-only
> data to/from various encodings. When I was done there was no longer much
> of a difference when encoding from String to an ASCII-compatible charset
> [1]. Maybe a scaling ~10% advantage for latin-1 in some microbenchmark
> when decoding on really large strings[2].
>
> But with the suggested change the decoding advantage would likely
> disappear: we maintain the invariant that Strings are equals if and only
> if their coders are equal, so no we'd now have to scan latin-1 encoded
> streams for non-ASCII bytes to disambiguate between ASCII and LATIN1.
>
> Maybe there are some other case that would be helped, but with some
> likely negatives and only a modest potential win then my gut feeling is
> that this wouldn't pay off.
>
> Best regards
> Claes
>
> [1] https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html
> [2] https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html
>
> (All the improvements discussed in the blog entires should be available
> since 17.0.2.)
>
> On 2022-05-18 14:07, Glavo wrote:
> > After the introduction of compact strings in Java 9, the current String
> may
> > store byte arrays encoded as Latin1 or UTF-16.
> > Here's a troublesome thing: Latin1 is not compatible with UTF-8. Not all
> > Latin1 byte strings are legal UTF-8 byte strings:
> > They are only compatible within the ASCII range, when there are code
> points
> > greater than 127, Latin1 uses a one-byte
> > representation, while UTF-8 requires two bytes.
> >
> > As an example, every time `JavaLangAccess::getBytesNoRepl` is called to
> > convert a string to a UTF-8 array,
> > the internal implementation needs to call `StringCoding.hasNegatives` to
> > scan the content byte array to determine that
> > the string can be encoded in ASCII, and thus eliminate array copies.
> > Similar steps are performed when calling `str.getBytes(UTF_8)`.
> >
> > So, is it possible to introduce a third possible value for
> `String::coder`:
> > ASCII?
> > This looks like an attractive option, and if we do this, we can introduce
> > fast paths for many methods.
> >
> > Of course, I know that this change is not completely free, and some
> methods
> > may bring slight performance
> > degradation due to the need to judge the coder, in particular, there may
> be
> > an impact on the performance of the
> > StringBuilder/StringBuffer. However, given that UTF-8 is by far the most
> > commonly used file encoding, the performance
> > benefits of fast paths are likely to cover more scenarios. In addition to
> > this, other ASCII compatible encodings may
> > also benefit, such as GB18030(GBK), or ISO 8859 variants. And if frozen
> > arrays were introduced into the JDK,
> > there would be more scenarios to enjoy performance improvements.
> >
> > So I would like to ask, is it possible for JDK to improve String storage
> in
> > a similar way in the future?
> > Has anyone explored this issue before?
> >
> > Sorry to bother you all, but I'm very much looking forward to the answer
> to
> > this question.
>


Re: Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

2022-05-18 Thread Claes Redestad

Hi,

so your suggestion is that String::coder would be ASCII if all
codepoints are 0-127, then LATIN1 if it contains at least one char in
the 128-255 range, otherwise UTF16? As you say this would mean we could
skip scanning StringCoding::hasNegatives scans and similar overheads
when converting known-to-be-ASCII Strings to UTF-8 and other ASCII
compatible encoding. But it might also complicate logic in various
places, so we need a clear win to motivate such work.

So the first question to answer might be how much does the hasNegatives
calls cost us. Depending on your hardware (and JDK version) the answer
might be "almost nothing"!

I did some work last year to speed up encoding and decoding ASCII-only
data to/from various encodings. When I was done there was no longer much
of a difference when encoding from String to an ASCII-compatible charset
[1]. Maybe a scaling ~10% advantage for latin-1 in some microbenchmark
when decoding on really large strings[2].

But with the suggested change the decoding advantage would likely
disappear: we maintain the invariant that Strings are equals if and only
if their coders are equal, so no we'd now have to scan latin-1 encoded 
streams for non-ASCII bytes to disambiguate between ASCII and LATIN1.


Maybe there are some other case that would be helped, but with some
likely negatives and only a modest potential win then my gut feeling is
that this wouldn't pay off.

Best regards
Claes

[1] https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html
[2] https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html

(All the improvements discussed in the blog entires should be available 
since 17.0.2.)


On 2022-05-18 14:07, Glavo wrote:

After the introduction of compact strings in Java 9, the current String may
store byte arrays encoded as Latin1 or UTF-16.
Here's a troublesome thing: Latin1 is not compatible with UTF-8. Not all
Latin1 byte strings are legal UTF-8 byte strings:
They are only compatible within the ASCII range, when there are code points
greater than 127, Latin1 uses a one-byte
representation, while UTF-8 requires two bytes.

As an example, every time `JavaLangAccess::getBytesNoRepl` is called to
convert a string to a UTF-8 array,
the internal implementation needs to call `StringCoding.hasNegatives` to
scan the content byte array to determine that
the string can be encoded in ASCII, and thus eliminate array copies.
Similar steps are performed when calling `str.getBytes(UTF_8)`.

So, is it possible to introduce a third possible value for `String::coder`:
ASCII?
This looks like an attractive option, and if we do this, we can introduce
fast paths for many methods.

Of course, I know that this change is not completely free, and some methods
may bring slight performance
degradation due to the need to judge the coder, in particular, there may be
an impact on the performance of the
StringBuilder/StringBuffer. However, given that UTF-8 is by far the most
commonly used file encoding, the performance
benefits of fast paths are likely to cover more scenarios. In addition to
this, other ASCII compatible encodings may
also benefit, such as GB18030(GBK), or ISO 8859 variants. And if frozen
arrays were introduced into the JDK,
there would be more scenarios to enjoy performance improvements.

So I would like to ask, is it possible for JDK to improve String storage in
a similar way in the future?
Has anyone explored this issue before?

Sorry to bother you all, but I'm very much looking forward to the answer to
this question.


Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

2022-05-18 Thread Glavo
After the introduction of compact strings in Java 9, the current String may
store byte arrays encoded as Latin1 or UTF-16.
Here's a troublesome thing: Latin1 is not compatible with UTF-8. Not all
Latin1 byte strings are legal UTF-8 byte strings:
They are only compatible within the ASCII range, when there are code points
greater than 127, Latin1 uses a one-byte
representation, while UTF-8 requires two bytes.

As an example, every time `JavaLangAccess::getBytesNoRepl` is called to
convert a string to a UTF-8 array,
the internal implementation needs to call `StringCoding.hasNegatives` to
scan the content byte array to determine that
the string can be encoded in ASCII, and thus eliminate array copies.
Similar steps are performed when calling `str.getBytes(UTF_8)`.

So, is it possible to introduce a third possible value for `String::coder`:
ASCII?
This looks like an attractive option, and if we do this, we can introduce
fast paths for many methods.

Of course, I know that this change is not completely free, and some methods
may bring slight performance
degradation due to the need to judge the coder, in particular, there may be
an impact on the performance of the
StringBuilder/StringBuffer. However, given that UTF-8 is by far the most
commonly used file encoding, the performance
benefits of fast paths are likely to cover more scenarios. In addition to
this, other ASCII compatible encodings may
also benefit, such as GB18030(GBK), or ISO 8859 variants. And if frozen
arrays were introduced into the JDK,
there would be more scenarios to enjoy performance improvements.

So I would like to ask, is it possible for JDK to improve String storage in
a similar way in the future?
Has anyone explored this issue before?

Sorry to bother you all, but I'm very much looking forward to the answer to
this question.