Re: FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Zhong Yu Tue, 19 Feb 2013 13:36:54 -0800

I think it's really easy for users to write a SubSequence by
themselves if they need to solve the problem; there is no need for
String to do it, and definitely not at the expense of breaching
previous explicit contracts. We should solve this problem externally,
explicitly; not internally with some surprising plumbings.


The patch may not be very helpful anyway. It is unlikely that an
application can simply switch from using sub-String to using
sub-CharSequence, since CharSequence is not nearly as convenient as
String - CharSequence has no contract on hashCode/equals, and no
operations like indexOf() etc. CharSequence is still not widely used,
not even in the standard lib - among java.** classes, besides
subclasses of CharSequence/Appendable, there are only 4 classes with
public methods accepting CharSequence args:  Normalizer, Matcher,
Pattern, CharsetEncoder.

A new top level public class that mimics String is probably going to
be helpful; this class may not be worthy of being included in
java.lang; it could be provided by 3rd parties. The class should

1. be able to wrap any CharSequence and char[] source
2. reference the source without copying, till explicitly asked to
3. though not immutable, remain constant if source remains constant
4. have well defined hashCode/equals
5. have convenience operations like indexOf()
6. include popular methods from various StringUtils
7. have well defined space/time cost on all methods

Let me call it "Strand". If anyone got a problem with the change of
String impl, we can refer him to wrap the string in a Strand, and
operate on the strand instead.

Zhong Yu

On Mon, Feb 18, 2013 at 11:27 PM, Mike Duigou <mike.dui...@oracle.com> wrote:
> Hello all;
>
> JDK 7u6 included a significant change to java.lang.String. The change was 
> internal to the String implementation and didn't result in any API changes 
> but it does have a significant impact on the performance of some uses cases. 
> (See 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html 
> for an earlier discussion)
>
> Prior to 7u6 String maintained two fields "count" and "offset" which 
> described the location within the character array of the String's characters. 
> In 7u6 the count and offset fields were removed and String instances no 
> longer share their character arrays with other String instances. Each 
> substring(), subSequence() or split() now creates entirely new String 
> instances for it's results. Before 7u6 two or more instances of String could 
> share the same backing character array. A number of String operations; 
> clone(), split(), subString(), subSequence() did not copy the backing 
> character array but created new String instances pointing at their character 
> array with appropriate "count" and "offset" values.
>
> As with most sharing techniques, there are tradeoffs. The "count/offset" 
> approach works reasonably well for cases where the substrings have a shorter 
> lifetime than the original. Frequently it was found though that the large 
> character arrays "leaked" through small Strings derived from larger String 
> instances. Examples would be tokens parsed with substring() from HTTP headers 
> being used as keys in Maps. This caused the entire header character array 
> from the original header String to not be garbage collected (or worse the 
> entire set of headers for a request).
>
> Our benchmarking and application testing prior to changing the String 
> implementation in 7u6 showed that it was a net benefit to not share the 
> character array instances. The benchmarking and performance testing for this 
> change actually began in 2007 and was very extensive. Benchmarking and 
> performance analysis since the release of 7u6 continues to indicate that 
> removal of sharing is the better approach. It is extremely unlikely that we 
> would consider returning to the pre-7u6 implementation (in case you were 
> thinking of suggesting that).
>
> There are some cases where the new approach can result in significant 
> performance penalties. This is a "For Your Consideration" review not a 
> pre-push changeset. The review changeset is a weakening of the "never share 
> the String character array" rule and it means that it would suffer from 
> exactly the same weakness. Few applications currently use subSequence() most 
> currently use subString(). Applications which would benefit from this change 
> would have to switch to using subSequence(). Apps can safely switch to 
> subSequence in anticipation of this change because currently subSequence() is 
> identical to substring(). This means that should this changeset not be 
> integrated app code would suffer no penalty and if this change is eventually 
> integrated then app performance will improve.
>
> http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/
>
> From our current testing we found that applications currently using 
> subSequence() failed if the equals(), hashCode() and toString() 
> implementations did not exactly match String. Additionally we had to change 
> String.equals() so that it recognizes can return "true" for matching 
> instances of String.SubSequence.
>
> You will see some unfortunate potential usage patterns in the presented 
> implementation--most specifically, calling toString() on the result of a 
> String.subSequence() results in a new String instance being created (ouch!). 
> I would like to eliminate the caching of the hashCode() result but it appears 
> that it is frequently used and failing to cache the hash code results in 
> greatly decreased performance for the relevant applications. Currently 
> TreeSet and TreeMap which use natural order fail for data sets of mixed 
> String and String.SubSequence. I believe it is necessary for natural order 
> sorting to work for mixed collections of String and String.SubSequence 
> instances.
>
> Would this proposal cause your applications any problems? Is this proposal 
> absolutely necessary for your application to have adequate performance? Have 
> you already made other accommodations for the altered performance behaviour 
> of Strings introduced in 7u6? Other thoughts?
>
> Mike

Re: FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Reply via email to