On 2017-12-05 00:46, Martin Buchholz wrote:
Thanks, Claes.
I think we're in agreement!
Great!
I did that shared String optimization for StringJoiner. It's a lot
harder to justify in the new String world because we have to handle
non-ASCII, and the non-ASCII case is actually fairly likely.
Let's sleep on it, at least. :-)
Does it make sense to keep COMPACT_STRINGS as an option in openjdk?
If essentially every user runs with that flag turned on, bit rot is
likely to set in, and even if not, we have to remain ever vigilant to
not break anyone turning it off.
I think it was only kept as a safeguard in case anyone would run into
serious trouble with the implementation (unlikely), so it might be
reasonable to deprecate the flag in an upcoming release. Maybe survey if
anyone ever had to opt-out of it first.
The code in Long.fastUUID is indeed ugly. I've never heard of UUID
creation being a bottleneck. At Google it sometimes seems all our
java performance problems are with zip file manipulation.
I see Steven already said Hi. I helped him get this optimization in, and
it did look a bit nicer originally. :-)
I might have avoided the code duplication in Long.fastUUID by:
- build the all-ASCII byte[] unconditionally
- if COMPACT_STRINGS, then use the secret LATIN1 constructor, else use
the public constructor(byte[], ISO_8859_1)
I think the implementors felt it would be unfair to penalize those who
had to opt-out of it for some unlikely reason.
One idea for a public performant API is to add something like
Charset.toString(byte[]) or Charset.toString(byte[], int start, int
length) via default method on Charset. Then; override that
in ISO_8859_1 to do only a single copy (and can we cheat somehow to do
no copies?)
I think these would need to clone the byte[] most of the time, so I
think it'd not be that much different from new String(byte[],
ISO_8859_1). The variant with Charset.toString(byte[], int, int) could
optimize away one copy when start > 0 or length < bytes.length, but this
seems like a corner case gain that doesn't really motivate adding two
new methods to every Charset class.
I guess a SharedStringBuilder (InlineStringBuilder?) could be made
public if we used $suitable_concurrency_trick to ensure the toString()
is callable at most once and that once it's been called the byte[] can't
be written to again. Maybe a ThreadLocal-based implementation could
suffice if we disallow nested InlineStringBuilder use (shouldn't be too
harsh of a limitation).
/Claes
On Mon, Dec 4, 2017 at 1:47 PM, Claes Redestad
<claes.redes...@oracle.com <mailto:claes.redes...@oracle.com>> wrote:
Hi Martin,
On 2017-12-04 22:06, Martin Buchholz wrote:
I'm rather sad about what happened to our non-copying String
constructions for trusted code. This is a performance
regression with the change in String representation that
should have blocked that change IMO. I think we should have a
plan for moving in the opposite direction. I don't think we
can implement something as ambitious as Rust's ownership
tracking, so have to restrict ourselves to trusted code. The
use case that keeps coming up is constructing zip entry names,
which are much more likely to be pure ASCII than their file
contents.
I don't have a good design for how one could do that, and who
the trusted set of callers is (at least java.base, not
java.lang), but I think we should set a direction.
as I alluded to in a footnote there exists a non-copying
String(byte[] value, byte coder) constructor - the problem is that
it's somewhat cumbersome to use:
- first off, the caller needs to be aware about the value of
String.COMPACT_STRINGS: if false, all strings needs to be UTF-16
encoded and the coder byte always set to String.UTF16
- secondly, the caller needs to know if the byte[] you're
constructing needs to be LATIN-1 or UTF-16 encoded up front and
act accordingly
Some of the more performance sensitive uses outside of java.lang
was addressed by the Compact Strings update, for example the
implementation backing java.util.UUID was somewhat surprisingly
moved into java.lang.Long::fastUUID[1]. Something similar is
doable for the java.sql types, but further complicated by those
classes being in a different module, and ultimately questionable
since their implementations in JDK 9 are quite a bit more
performant than in any previous release (thus not technically a
regression).
That leaves StringJoiner as the one case that stands out. And the
fact that existing uses of String(byte[], byte) are a bit of an
eye-sore[1!!1!].
One idea I'm tinkering with here is to have a trusted,
package-private SharedStringBuilder added to java.lang and exposed
via SharedSecrets. It'd more or less mimic StringBuilder
(including deal with inflating the byte[] when necessary,
encapsulate the awkward String.COMPACT_STRINGS checks etc) but
enable calling String(byte[], byte) in the toString() call. To be
effective it'll only have a single constructor taking the
capacity, and should probably throw IOOBE rather than resize the
internal buffer. Some cases like Long::fastUUID could probably be
much simplified by using such a builder instead (for a very
minimal overhead). Does that sound reasonable? At any rate I think
of this as a possible follow-up RFE, and not an alternative to the
cleanup/"bugfix" at hand.
Thanks!
/Claes
[1]
http://hg.openjdk.java.net/jdk/jdk/file/532cdc178e42/src/java.base/share/classes/java/lang/Long.java#l427
<http://hg.openjdk.java.net/jdk/jdk/file/532cdc178e42/src/java.base/share/classes/java/lang/Long.java#l427>