Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

Glavo Wed, 18 May 2022 05:07:59 -0700

After the introduction of compact strings in Java 9, the current String may
store byte arrays encoded as Latin1 or UTF-16.
Here's a troublesome thing: Latin1 is not compatible with UTF-8. Not all
Latin1 byte strings are legal UTF-8 byte strings:
They are only compatible within the ASCII range, when there are code points
greater than 127, Latin1 uses a one-byte
representation, while UTF-8 requires two bytes.


As an example, every time `JavaLangAccess::getBytesNoRepl` is called to
convert a string to a UTF-8 array,
the internal implementation needs to call `StringCoding.hasNegatives` to
scan the content byte array to determine that
the string can be encoded in ASCII, and thus eliminate array copies.
Similar steps are performed when calling `str.getBytes(UTF_8)`.

So, is it possible to introduce a third possible value for `String::coder`:
ASCII?
This looks like an attractive option, and if we do this, we can introduce
fast paths for many methods.

Of course, I know that this change is not completely free, and some methods
may bring slight performance
degradation due to the need to judge the coder, in particular, there may be
an impact on the performance of the
StringBuilder/StringBuffer. However, given that UTF-8 is by far the most
commonly used file encoding, the performance
benefits of fast paths are likely to cover more scenarios. In addition to
this, other ASCII compatible encodings may
also benefit, such as GB18030(GBK), or ISO 8859 variants. And if frozen
arrays were introduced into the JDK,
there would be more scenarios to enjoy performance improvements.

So I would like to ask, is it possible for JDK to improve String storage in
a similar way in the future?
Has anyone explored this issue before?

Sorry to bother you all, but I'm very much looking forward to the answer to
this question.

Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

Reply via email to