On Jun 9, 2018, at 12:18 PM, Xueming Shen <xueming.s...@oracle.com> wrote: > > Ideally I would assume we would want to have a utf-8 internal storage for > String, even in theory utf8 is supposed to be used externally and utf16 > to be the internal one.
Separately from my point about ByteSequence, I agree that "doubling down" on Utf8 as a standard format for packed strings is a good idea. A reasonable way to prototype right now would be an implementation of CharSequence that is backed by a byte[] (eventually ByteSequence) and has some sort of fast access (probably streaming) to Utf16 code points. To make it pay for itself the Utf8 encoding should be applicable as an overlay in as many places as possible, including slices of byte[] and ByteBuffer objects, and later ByteSequences. > Defensive copy when getting byte[] in & out of String object seems still > inevitable for now, before we can have something like "read-only" byte[], > given the nature of its immutability commitment. We didn't need frozen char[] arrays to avoid defensive copying of String objects, only an immutability invariant on the class. We could pull a similar trick with Utf8 by supplying a ByteSequence view of a String's underlying bytes. If the String has underlying chars (Utf16) a view is also possible, although it is more difficult to get right (as you described). — John