Re: FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Remi Forax Tue, 19 Feb 2013 06:17:06 -0800

Hi Mike,
I agree with Peter,

a new method is better than changing subSequence to return aCharSequence implementation that try to fake itself as a String.


And as a guys that have written several parsers (for HTTP servers or not),

ByteBuffer/CharBuffer was introduced in 1.4 to write effective parsersin Java,

no one should try write a Java parser using String buffers anymore.

Rémi

On 02/19/2013 10:28 AM, Peter Levart wrote:

Hi Mike,
Regarding the implementation details: I think it would be better toreference just the String's value[] array in the String.SubSequenceinstead of the String instance itself. Two reasons come to mind:
- eliminate another indirection when accessing the array
- when referencing the String instance, code like this would leakimplementation details (again):
String s = new String("...");
CharSequence cs = s.subSequence(0, s.length()-1);
WeakReference<String> wr = new WeakReference<>(s);
// ... if we keep a reference to cs, wr will never be cleared...
Regarding the strategy: It's very unfortunate that code exists thatuses String.subSequence result in a way treating it as an object withvalue semantics for equals() and hashCode() despite the specificationfor CharSequence clearly stating:
/"This interface does not refine the general contracts of the//|equals|<http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#equals%28java.lang.Object%29>//and//|hashCode|<http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29>//methods.The result of comparing two objects that implement //CharSequence//istherefore, in general, undefined. Each object may be implemented by adifferent class, and there is no guarantee that each class will becapable of testing its instances for equality with those of the other.It is therefore inappropriate to use arbitrary//CharSequence//instances as elements in a set or as keys in a map."/
The excuse for such usage is the specification of String.subSequencemethod:
/"Returns a new character sequence that is a subsequence of thissequence. /
/An invocation of this method of the form /

   /  str.subSequence(begin, end)/

/behaves in exactly the same way as the invocation /

   /  str.substring(begin, end)/
/This method is defined so that the //String//class can implement the//|CharSequence|<http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html>//interface."/
So this proposal actually breaks this specification. It tries to breakit in a way that would keep some of the usages still behave likebefore, but it can't entirely succeed. The following code is validnow, but will throw CCE with this proposal:
String s = ...;
String s0 = (String) s.subSequence(0, 1);
One can imagine other scenarios that would break (like customcomparators casting to String, etc.).
So to be entirely compatible, an escape-hatch would have to beavailable (in the form of a System property for example) to restorethe old behaviour if requested.
If there is an escape-hatch, the question arises: Why not breaking theString.subSequence entirely? We could brake it so that the methodwould clearly specify the returned CharSequence behaviour regardingunderlying value[] array sharing, hashCode(), equals(), subSequence()and Comparable only in terms of CharSequence instances returned fromthe method and not cross String <-> String.SubSequence.
This would encourage migration of behaviourally incompatible code toforms that are compatible with both pre JDK8 and post JDK8String.subSequence specification instead of keeping the status quo.
There's also a third option - adding new method (back-porting it toJDK7):
public String.SubSequence stringSubSequence(int, int);

...and keeping subSequence as is.

Regards, Peter

On 02/19/2013 06:27 AM, Mike Duigou wrote:
Hello all;
JDK 7u6 included a significant change to java.lang.String. The changewas internal to the String implementation and didn't result in anyAPI changes but it does have a significant impact on the performanceof some uses cases. (Seehttp://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.htmlfor an earlier discussion)
Prior to 7u6 String maintained two fields "count" and "offset" whichdescribed the location within the character array of the String'scharacters. In 7u6 the count and offset fields were removed andString instances no longer share their character arrays with otherString instances. Each substring(), subSequence() or split() nowcreates entirely new String instances for it's results. Before 7u6two or more instances of String could share the same backingcharacter array. A number of String operations; clone(), split(),subString(), subSequence() did not copy the backing character arraybut created new String instances pointing at their character arraywith appropriate "count" and "offset" values.
As with most sharing techniques, there are tradeoffs. The"count/offset" approach works reasonably well for cases where thesubstrings have a shorter lifetime than the original. Frequently itwas found though that the large character arrays "leaked" throughsmall Strings derived from larger String instances. Examples would betokens parsed with substring() from HTTP headers being used as keysin Maps. This caused the entire header character array from theoriginal header String to not be garbage collected (or worse theentire set of headers for a request).
Our benchmarking and application testing prior to changing the Stringimplementation in 7u6 showed that it was a net benefit to not sharethe character array instances. The benchmarking and performancetesting for this change actually began in 2007 and was veryextensive. Benchmarking and performance analysis since the release of7u6 continues to indicate that removal of sharing is the betterapproach. It is extremely unlikely that we would consider returningto the pre-7u6 implementation (in case you were thinking ofsuggesting that).
There are some cases where the new approach can result in significantperformance penalties. This is a "For Your Consideration" review nota pre-push changeset. The review changeset is a weakening of the"never share the String character array" rule and it means that itwould suffer from exactly the same weakness. Few applicationscurrently use subSequence() most currently use subString().Applications which would benefit from this change would have toswitch to using subSequence(). Apps can safely switch to subSequencein anticipation of this change because currently subSequence() isidentical to substring(). This means that should this changeset notbe integrated app code would suffer no penalty and if this change iseventually integrated then app performance will improve.
http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/
From our current testing we found that applications currently usingsubSequence() failed if the equals(), hashCode() and toString()implementations did not exactly match String. Additionally we had tochange String.equals() so that it recognizes can return "true" formatching instances of String.SubSequence.
You will see some unfortunate potential usage patterns in thepresented implementation--most specifically, calling toString() onthe result of a String.subSequence() results in a new String instancebeing created (ouch!). I would like to eliminate the caching of thehashCode() result but it appears that it is frequently used andfailing to cache the hash code results in greatly decreasedperformance for the relevant applications. Currently TreeSet andTreeMap which use natural order fail for data sets of mixed Stringand String.SubSequence. I believe it is necessary for natural ordersorting to work for mixed collections of String andString.SubSequence instances.
Would this proposal cause your applications any problems? Is thisproposal absolutely necessary for your application to have adequateperformance? Have you already made other accommodations for thealtered performance behaviour of Strings introduced in 7u6? Otherthoughts?
Mike

Re: FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Reply via email to