Re: FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Peter Levart Tue, 19 Feb 2013 01:29:24 -0800

Hi Mike,

Regarding the implementation details: I think it would be better toreference just the String's value[] array in the String.SubSequenceinstead of the String instance itself. Two reasons come to mind:


- eliminate another indirection when accessing the array

- when referencing the String instance, code like this would leakimplementation details (again):


String s = new String("...");
CharSequence cs = s.subSequence(0, s.length()-1);
WeakReference<String> wr = new WeakReference<>(s);
// ... if we keep a reference to cs, wr will never be cleared...

Regarding the strategy: It's very unfortunate that code exists that usesString.subSequence result in a way treating it as an object with valuesemantics for equals() and hashCode() despite the specification forCharSequence clearly stating:

/"This interface does not refine the general contracts of the //|equals|<http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#equals%28java.lang.Object%29>//and//|hashCode|<http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29>//methods.The result of comparing two objects that implement //CharSequence//istherefore, in general, undefined. Each object may be implemented by adifferent class, and there is no guarantee that each class will becapable of testing its instances for equality with those of the other.It is therefore inappropriate to use arbitrary //CharSequence//instancesas elements in a set or as keys in a map."/


The excuse for such usage is the specification of String.subSequence method:

/"Returns a new character sequence that is a subsequence of thissequence. /


/An invocation of this method of the form /

   /  str.subSequence(begin, end)/

/behaves in exactly the same way as the invocation /

   /  str.substring(begin, end)/

/This method is defined so that the //String//class can implement the//|CharSequence|<http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html>//interface."/

So this proposal actually breaks this specification. It tries to breakit in a way that would keep some of the usages still behave like before,but it can't entirely succeed. The following code is valid now, but willthrow CCE with this proposal:


String s = ...;
String s0 = (String) s.subSequence(0, 1);

One can imagine other scenarios that would break (like customcomparators casting to String, etc.).

So to be entirely compatible, an escape-hatch would have to be available(in the form of a System property for example) to restore the oldbehaviour if requested.

If there is an escape-hatch, the question arises: Why not breaking theString.subSequence entirely? We could brake it so that the method wouldclearly specify the returned CharSequence behaviour regarding underlyingvalue[] array sharing, hashCode(), equals(), subSequence() andComparable only in terms of CharSequence instances returned from themethod and not cross String <-> String.SubSequence.

This would encourage migration of behaviourally incompatible code toforms that are compatible with both pre JDK8 and post JDK8String.subSequence specification instead of keeping the status quo.


There's also a third option - adding new method (back-porting it to JDK7):

public String.SubSequence stringSubSequence(int, int);

...and keeping subSequence as is.

Regards, Peter

On 02/19/2013 06:27 AM, Mike Duigou wrote:

Hello all;

JDK 7u6 included a significant change to java.lang.String. The change was
internal to the String implementation and didn't result in any API changes but
it does have a significant impact on the performance of some uses cases. (See
http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html for
an earlier discussion)

Prior to 7u6 String maintained two fields "count" and "offset" which described the location within
the character array of the String's characters. In 7u6 the count and offset fields were removed and String instances no
longer share their character arrays with other String instances. Each substring(), subSequence() or split() now creates
entirely new String instances for it's results. Before 7u6 two or more instances of String could share the same backing
character array. A number of String operations; clone(), split(), subString(), subSequence() did not copy the backing
character array but created new String instances pointing at their character array with appropriate "count"
and "offset" values.

As with most sharing techniques, there are tradeoffs. The "count/offset" approach works
reasonably well for cases where the substrings have a shorter lifetime than the original.
Frequently it was found though that the large character arrays "leaked" through small
Strings derived from larger String instances. Examples would be tokens parsed with substring() from
HTTP headers being used as keys in Maps. This caused the entire header character array from the
original header String to not be garbage collected (or worse the entire set of headers for a
request).

Our benchmarking and application testing prior to changing the String
implementation in 7u6 showed that it was a net benefit to not share the
character array instances. The benchmarking and performance testing for this
change actually began in 2007 and was very extensive. Benchmarking and
performance analysis since the release of 7u6 continues to indicate that
removal of sharing is the better approach. It is extremely unlikely that we
would consider returning to the pre-7u6 implementation (in case you were
thinking of suggesting that).

There are some cases where the new approach can result in significant performance penalties. This
is a "For Your Consideration" review not a pre-push changeset. The review changeset is a
weakening of the "never share the String character array" rule and it means that it would
suffer from exactly the same weakness. Few applications currently use subSequence() most currently
use subString(). Applications which would benefit from this change would have to switch to using
subSequence(). Apps can safely switch to subSequence in anticipation of this change because
currently subSequence() is identical to substring(). This means that should this changeset not be
integrated app code would suffer no penalty and if this change is eventually integrated then app
performance will improve.

http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/

From our current testing we found that applications currently using subSequence() failed
if the equals(), hashCode() and toString() implementations did not exactly match String.
Additionally we had to change String.equals() so that it recognizes can return
"true" for matching instances of String.SubSequence.

You will see some unfortunate potential usage patterns in the presented
implementation--most specifically, calling toString() on the result of a
String.subSequence() results in a new String instance being created (ouch!). I
would like to eliminate the caching of the hashCode() result but it appears
that it is frequently used and failing to cache the hash code results in
greatly decreased performance for the relevant applications. Currently TreeSet
and TreeMap which use natural order fail for data sets of mixed String and
String.SubSequence. I believe it is necessary for natural order sorting to work
for mixed collections of String and String.SubSequence instances.

Would this proposal cause your applications any problems? Is this proposal
absolutely necessary for your application to have adequate performance? Have
you already made other accommodations for the altered performance behaviour of
Strings introduced in 7u6? Other thoughts?

Mike

Re: FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Reply via email to