I'm glad to see you are thinking about this, Florian.

You appear to be aiming at a way to compactly store and manipulate
series of octets (in an arbitrary encoding) with an emphasis on using
those octets to represent strings, in the usual sense of character sequences.

Would you agree that this design problem factors well into a generic
problem of storing and manipulating octet sequences, plus a detachable
upper layer that allows strings (in various encodings) to be extracted
from those sequences?  I think the sweet spot here is to introduce
a "stringy but char-free" API which commits to dealing with chunks
of memory (viewed as octet sequences), regardless of how those
chunks will be interpreted.

In https://bugs.openjdk.java.net/browse/JDK-8161256 I discuss
this nascent API under the name "ByteSequence", which is analogous
to CharSequence, but doesn't mention the types 'char' or 'String'.

By "stringy" I mean that there are natural ways to index or partition an
existing sequence of octets, or concatenate multiple sequences into
one.  I also mean that immutability plays a strong role, enabling
algorithms to work without defensive copies.  Making it an interface
like CharSequence means we can use backing stores like ByteBuffer
or byte[], or more exotic things like Panama native memory, interoperably.

Here are some uses for an immutable octet sequence type:

 - manipulation of non-UTF16 character data (which you mention)
 - zero copy views (slices, modifiable or not) into existing types (ByteBuffer, 
byte[], etc.)
 - zero copy views into file images (N.B. requires a 'long' size property, not 
'int')
 - zero copy views to intra-classfile resources (CONSTANT_Bytes)
 - backing stores for Panama data structures and smart pointers
 - copy-reduced scatter and gather nodes associated with packet processing
 - octet-level cursors for parsers, scanners, packet decoders, etc.

If the ByteSequence views are value instances, they can be created
at a very high rate with little or no GC impact.  Generic algorithms
would still operate on them 

A mutable octet sequence class, analogous to StringBuilder, would
allow immutable sequences to be built with fewer intermediate copies,
just like with StringBuilder.

If the API is properly defined it can be inserted directly into existing types
like ByteBuffer.  Doing this will probably require us to polish ByteBuffer
a little, adding immutability as an option and lifting the 32-bit limits.
It should be possible to "freeze" a ByteBuffer or array and use it as
a backing store that is reliably immutable, so it can be handed to
zero-copy algorithms that work with ByteSequences. For some of
this see https://bugs.openjdk.java.net/browse/JDK-8180628 .

Independently, I want to eventually add frozen arrays, including
frozen byte[] arrays, to the JVM, but that doesn't cover zero-copy use
cases; it has to be an interface like CharSequence.

So the option I prefer is not on your list; it would be:

(h) ByteSequence interface with retrofits to ByteBuffer, byte[], etc.

This is more flexible than (f) the concrete ByteString class.  I think
the ByteString you are thinking of would appear as a non-public class
created by a ByteSequence factory, analogous to List::of.

— John

On Jun 9, 2018, at 3:27 AM, Florian Weimer <f...@deneb.enyo.de> wrote:
> 
> Lately I've been thinking about string representation.  The world
> turned out not to be UCS-2 or UTF-16, after all, and we often have to
> deal with strings generally encoded as ASCII or UTF-8, but we aren't
> always encoded this way (and there might not even be a charset
> declaration, see the ELF spec).
> 
> (a) byte[] with defensive copies.
>    Internal storage is byte[], copy is made before returning it to
>    the caller.  Quite common across the JDK.
> 
> (b) byte[] without defensive copies.
>    Internal storage is byte[], and a reference is returned.  In the
>    past, this could be a security bug, and usually, it was adjusted
>    to (a) when noticed.  Without security requirements, this can be
>    quite efficient, but there is ample potential for API misuse.
> 
> (c) java.lang.String with ISO-8859-1 decoding/encoding.
>    Sometimes done by reconfiguring the entire JVM to run with
>    ISO-8859-1, usually so that it is possible to process malformed
>    UTF-8.  The advantage is that there is rich API support, including
>    regular expressions, and good optimization.  There is also
>    language support for string literals.
> 
> (d) java.lang.String with UTF-8 decoding/encoding and replacement.
>    This seems to be very common, but is not completely accurate
>    and can lead to subtle bugs (or completely non-processible
>    data).  Otherwise has the same advantages as (c).
> 
> (e) Various variants of ByteBuffer.
>    Have not seen this much in practice (outside binary file format
>    parsers).  In the past, it needed deep defensive copies on input
>    for security (because there isn't an immutably backed ByteBuffer),
>    and shallow copies for access.  The ByteBuffer objects themselves
>    are also quite heavy when they can't be optimized away.  For that
>    reason, probably most useful on interfaces, and not for storage.
> 
> (f) Custom, immutable ByteString class.
>    Quite common, but has cross-library interoperability issues,
>    and a full complement of support (matching java.lang.String)
>    is quite hard.
> 
> (g) Something based on VarHandle.
>    Haven't seen this yet.  Probably not useful for storage.
> 
> Anything that I have missed?
> 
> Considering these choices, what is the expected direction on the JDK
> side for new code?  Option (d) for things generally ASCII/UTF-8, and
> (b) for things of a more binary nature?  What to do if the choice is
> difficult?

Reply via email to