On 6/9/18, 3:27 AM, Florian Weimer wrote:
Lately I've been thinking about string representation.  The world
turned out not to be UCS-2 or UTF-16, after all, and we often have to
deal with strings generally encoded as ASCII or UTF-8, but we aren't
always encoded this way (and there might not even be a charset
declaration, see the ELF spec).

(a) byte[] with defensive copies.
     Internal storage is byte[], copy is made before returning it to
     the caller.  Quite common across the JDK.

(b) byte[] without defensive copies.
     Internal storage is byte[], and a reference is returned.  In the
     past, this could be a security bug, and usually, it was adjusted
     to (a) when noticed.  Without security requirements, this can be
     quite efficient, but there is ample potential for API misuse.

(c) java.lang.String with ISO-8859-1 decoding/encoding.
     Sometimes done by reconfiguring the entire JVM to run with
     ISO-8859-1, usually so that it is possible to process malformed
     UTF-8.  The advantage is that there is rich API support, including
     regular expressions, and good optimization.  There is also
     language support for string literals.

(d) java.lang.String with UTF-8 decoding/encoding and replacement.
     This seems to be very common, but is not completely accurate
     and can lead to subtle bugs (or completely non-processible
     data).  Otherwise has the same advantages as (c).

(e) Various variants of ByteBuffer.
     Have not seen this much in practice (outside binary file format
     parsers).  In the past, it needed deep defensive copies on input
     for security (because there isn't an immutably backed ByteBuffer),
     and shallow copies for access.  The ByteBuffer objects themselves
     are also quite heavy when they can't be optimized away.  For that
     reason, probably most useful on interfaces, and not for storage.

(f) Custom, immutable ByteString class.
     Quite common, but has cross-library interoperability issues,
     and a full complement of support (matching java.lang.String)
     is quite hard.

(g) Something based on VarHandle.
     Haven't seen this yet.  Probably not useful for storage.

Anything that I have missed?

Considering these choices, what is the expected direction on the JDK
side for new code?  Option (d) for things generally ASCII/UTF-8, and
(b) for things of a more binary nature?  What to do if the choice is
difficult?

Hi Florian,

Some comments about the j.l.String storage.

Ideally I would assume we would want to have a utf-8 internal storage for
String, even in theory utf8 is supposed to be used externally and utf16
to be the internal one. I did have a byte[]/utf-8 prototype implementation
when we did the compact string for jdk9 but that was finally dropped because
of the potential performance regression for index base access, such as the
basic String.charAt(int), as you have to count from the beginning to locate
the target character each every time. But I think we might want to try it again
later, especially for use scenario that index base access performance is not
that important/critical and the throughput operation of the String, means
input from /output to the external utf-8/byte[] world, is more desired. Given
we are heading utf-8 as the default encoding for jvm [1], I think we might
want to at least provide some alternative that you can "optionally" do that
for String object. The idea might go further (wild, just an idea, not necessary something thing we really want to do :-) for Java String) to other charsets,
so you can simply store the byte[] (verified no malformed/unmappable) +
charsetId directly when creating a String object. This might be useful and
efficient in use scenario that the String object is simply a vehicle to carry a
sequence of characters back and forth between a front end server and back
end server, the jvm is simply passing them around/through.

Defensive copy when getting byte[] in & out of String object seems still
inevitable for now, before we can have something like "read-only" byte[],
given the nature of its immutability commitment.

Regards,
Sherman

[1] https://bugs.openjdk.java.net/browse/JDK-8187041

Reply via email to