I'm glad to see you are thinking about this, Florian. You appear to be aiming at a way to compactly store and manipulate series of octets (in an arbitrary encoding) with an emphasis on using those octets to represent strings, in the usual sense of character sequences.
Would you agree that this design problem factors well into a generic problem of storing and manipulating octet sequences, plus a detachable upper layer that allows strings (in various encodings) to be extracted from those sequences? I think the sweet spot here is to introduce a "stringy but char-free" API which commits to dealing with chunks of memory (viewed as octet sequences), regardless of how those chunks will be interpreted. In https://bugs.openjdk.java.net/browse/JDK-8161256 I discuss this nascent API under the name "ByteSequence", which is analogous to CharSequence, but doesn't mention the types 'char' or 'String'. By "stringy" I mean that there are natural ways to index or partition an existing sequence of octets, or concatenate multiple sequences into one. I also mean that immutability plays a strong role, enabling algorithms to work without defensive copies. Making it an interface like CharSequence means we can use backing stores like ByteBuffer or byte[], or more exotic things like Panama native memory, interoperably. Here are some uses for an immutable octet sequence type: - manipulation of non-UTF16 character data (which you mention) - zero copy views (slices, modifiable or not) into existing types (ByteBuffer, byte[], etc.) - zero copy views into file images (N.B. requires a 'long' size property, not 'int') - zero copy views to intra-classfile resources (CONSTANT_Bytes) - backing stores for Panama data structures and smart pointers - copy-reduced scatter and gather nodes associated with packet processing - octet-level cursors for parsers, scanners, packet decoders, etc. If the ByteSequence views are value instances, they can be created at a very high rate with little or no GC impact. Generic algorithms would still operate on them A mutable octet sequence class, analogous to StringBuilder, would allow immutable sequences to be built with fewer intermediate copies, just like with StringBuilder. If the API is properly defined it can be inserted directly into existing types like ByteBuffer. Doing this will probably require us to polish ByteBuffer a little, adding immutability as an option and lifting the 32-bit limits. It should be possible to "freeze" a ByteBuffer or array and use it as a backing store that is reliably immutable, so it can be handed to zero-copy algorithms that work with ByteSequences. For some of this see https://bugs.openjdk.java.net/browse/JDK-8180628 . Independently, I want to eventually add frozen arrays, including frozen byte[] arrays, to the JVM, but that doesn't cover zero-copy use cases; it has to be an interface like CharSequence. So the option I prefer is not on your list; it would be: (h) ByteSequence interface with retrofits to ByteBuffer, byte[], etc. This is more flexible than (f) the concrete ByteString class. I think the ByteString you are thinking of would appear as a non-public class created by a ByteSequence factory, analogous to List::of. — John On Jun 9, 2018, at 3:27 AM, Florian Weimer <f...@deneb.enyo.de> wrote: > > Lately I've been thinking about string representation. The world > turned out not to be UCS-2 or UTF-16, after all, and we often have to > deal with strings generally encoded as ASCII or UTF-8, but we aren't > always encoded this way (and there might not even be a charset > declaration, see the ELF spec). > > (a) byte[] with defensive copies. > Internal storage is byte[], copy is made before returning it to > the caller. Quite common across the JDK. > > (b) byte[] without defensive copies. > Internal storage is byte[], and a reference is returned. In the > past, this could be a security bug, and usually, it was adjusted > to (a) when noticed. Without security requirements, this can be > quite efficient, but there is ample potential for API misuse. > > (c) java.lang.String with ISO-8859-1 decoding/encoding. > Sometimes done by reconfiguring the entire JVM to run with > ISO-8859-1, usually so that it is possible to process malformed > UTF-8. The advantage is that there is rich API support, including > regular expressions, and good optimization. There is also > language support for string literals. > > (d) java.lang.String with UTF-8 decoding/encoding and replacement. > This seems to be very common, but is not completely accurate > and can lead to subtle bugs (or completely non-processible > data). Otherwise has the same advantages as (c). > > (e) Various variants of ByteBuffer. > Have not seen this much in practice (outside binary file format > parsers). In the past, it needed deep defensive copies on input > for security (because there isn't an immutably backed ByteBuffer), > and shallow copies for access. The ByteBuffer objects themselves > are also quite heavy when they can't be optimized away. For that > reason, probably most useful on interfaces, and not for storage. > > (f) Custom, immutable ByteString class. > Quite common, but has cross-library interoperability issues, > and a full complement of support (matching java.lang.String) > is quite hard. > > (g) Something based on VarHandle. > Haven't seen this yet. Probably not useful for storage. > > Anything that I have missed? > > Considering these choices, what is the expected direction on the JDK > side for new code? Option (d) for things generally ASCII/UTF-8, and > (b) for things of a more binary nature? What to do if the choice is > difficult?