Hello all;

JDK 7u6 included a significant change to java.lang.String. The change was 
internal to the String implementation and didn't result in any API changes but 
it does have a significant impact on the performance of some uses cases. (See 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html for 
an earlier discussion)

Prior to 7u6 String maintained two fields "count" and "offset" which described 
the location within the character array of the String's characters. In 7u6 the 
count and offset fields were removed and String instances no longer share their 
character arrays with other String instances. Each substring(), subSequence() 
or split() now creates entirely new String instances for it's results. Before 
7u6 two or more instances of String could share the same backing character 
array. A number of String operations; clone(), split(), subString(), 
subSequence() did not copy the backing character array but created new String 
instances pointing at their character array with appropriate "count" and 
"offset" values.

As with most sharing techniques, there are tradeoffs. The "count/offset" 
approach works reasonably well for cases where the substrings have a shorter 
lifetime than the original. Frequently it was found though that the large 
character arrays "leaked" through small Strings derived from larger String 
instances. Examples would be tokens parsed with substring() from HTTP headers 
being used as keys in Maps. This caused the entire header character array from 
the original header String to not be garbage collected (or worse the entire set 
of headers for a request).

Our benchmarking and application testing prior to changing the String 
implementation in 7u6 showed that it was a net benefit to not share the 
character array instances. The benchmarking and performance testing for this 
change actually began in 2007 and was very extensive. Benchmarking and 
performance analysis since the release of 7u6 continues to indicate that 
removal of sharing is the better approach. It is extremely unlikely that we 
would consider returning to the pre-7u6 implementation (in case you were 
thinking of suggesting that).

There are some cases where the new approach can result in significant 
performance penalties. This is a "For Your Consideration" review not a pre-push 
changeset. The review changeset is a weakening of the "never share the String 
character array" rule and it means that it would suffer from exactly the same 
weakness. Few applications currently use subSequence() most currently use 
subString(). Applications which would benefit from this change would have to 
switch to using subSequence(). Apps can safely switch to subSequence in 
anticipation of this change because currently subSequence() is identical to 
substring(). This means that should this changeset not be integrated app code 
would suffer no penalty and if this change is eventually integrated then app 
performance will improve.

http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/

From our current testing we found that applications currently using 
subSequence() failed if the equals(), hashCode() and toString() implementations 
did not exactly match String. Additionally we had to change String.equals() so 
that it recognizes can return "true" for matching instances of 
String.SubSequence. 

You will see some unfortunate potential usage patterns in the presented 
implementation--most specifically, calling toString() on the result of a 
String.subSequence() results in a new String instance being created (ouch!). I 
would like to eliminate the caching of the hashCode() result but it appears 
that it is frequently used and failing to cache the hash code results in 
greatly decreased performance for the relevant applications. Currently TreeSet 
and TreeMap which use natural order fail for data sets of mixed String and 
String.SubSequence. I believe it is necessary for natural order sorting to work 
for mixed collections of String and String.SubSequence instances.

Would this proposal cause your applications any problems? Is this proposal 
absolutely necessary for your application to have adequate performance? Have 
you already made other accommodations for the altered performance behaviour of 
Strings introduced in 7u6? Other thoughts?

Mike

Reply via email to