I think it's really easy for users to write a SubSequence by themselves if they need to solve the problem; there is no need for String to do it, and definitely not at the expense of breaching previous explicit contracts. We should solve this problem externally, explicitly; not internally with some surprising plumbings.
The patch may not be very helpful anyway. It is unlikely that an application can simply switch from using sub-String to using sub-CharSequence, since CharSequence is not nearly as convenient as String - CharSequence has no contract on hashCode/equals, and no operations like indexOf() etc. CharSequence is still not widely used, not even in the standard lib - among java.** classes, besides subclasses of CharSequence/Appendable, there are only 4 classes with public methods accepting CharSequence args: Normalizer, Matcher, Pattern, CharsetEncoder. A new top level public class that mimics String is probably going to be helpful; this class may not be worthy of being included in java.lang; it could be provided by 3rd parties. The class should 1. be able to wrap any CharSequence and char[] source 2. reference the source without copying, till explicitly asked to 3. though not immutable, remain constant if source remains constant 4. have well defined hashCode/equals 5. have convenience operations like indexOf() 6. include popular methods from various StringUtils 7. have well defined space/time cost on all methods Let me call it "Strand". If anyone got a problem with the change of String impl, we can refer him to wrap the string in a Strand, and operate on the strand instead. Zhong Yu On Mon, Feb 18, 2013 at 11:27 PM, Mike Duigou <mike.dui...@oracle.com> wrote: > Hello all; > > JDK 7u6 included a significant change to java.lang.String. The change was > internal to the String implementation and didn't result in any API changes > but it does have a significant impact on the performance of some uses cases. > (See > http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html > for an earlier discussion) > > Prior to 7u6 String maintained two fields "count" and "offset" which > described the location within the character array of the String's characters. > In 7u6 the count and offset fields were removed and String instances no > longer share their character arrays with other String instances. Each > substring(), subSequence() or split() now creates entirely new String > instances for it's results. Before 7u6 two or more instances of String could > share the same backing character array. A number of String operations; > clone(), split(), subString(), subSequence() did not copy the backing > character array but created new String instances pointing at their character > array with appropriate "count" and "offset" values. > > As with most sharing techniques, there are tradeoffs. The "count/offset" > approach works reasonably well for cases where the substrings have a shorter > lifetime than the original. Frequently it was found though that the large > character arrays "leaked" through small Strings derived from larger String > instances. Examples would be tokens parsed with substring() from HTTP headers > being used as keys in Maps. This caused the entire header character array > from the original header String to not be garbage collected (or worse the > entire set of headers for a request). > > Our benchmarking and application testing prior to changing the String > implementation in 7u6 showed that it was a net benefit to not share the > character array instances. The benchmarking and performance testing for this > change actually began in 2007 and was very extensive. Benchmarking and > performance analysis since the release of 7u6 continues to indicate that > removal of sharing is the better approach. It is extremely unlikely that we > would consider returning to the pre-7u6 implementation (in case you were > thinking of suggesting that). > > There are some cases where the new approach can result in significant > performance penalties. This is a "For Your Consideration" review not a > pre-push changeset. The review changeset is a weakening of the "never share > the String character array" rule and it means that it would suffer from > exactly the same weakness. Few applications currently use subSequence() most > currently use subString(). Applications which would benefit from this change > would have to switch to using subSequence(). Apps can safely switch to > subSequence in anticipation of this change because currently subSequence() is > identical to substring(). This means that should this changeset not be > integrated app code would suffer no penalty and if this change is eventually > integrated then app performance will improve. > > http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/ > > From our current testing we found that applications currently using > subSequence() failed if the equals(), hashCode() and toString() > implementations did not exactly match String. Additionally we had to change > String.equals() so that it recognizes can return "true" for matching > instances of String.SubSequence. > > You will see some unfortunate potential usage patterns in the presented > implementation--most specifically, calling toString() on the result of a > String.subSequence() results in a new String instance being created (ouch!). > I would like to eliminate the caching of the hashCode() result but it appears > that it is frequently used and failing to cache the hash code results in > greatly decreased performance for the relevant applications. Currently > TreeSet and TreeMap which use natural order fail for data sets of mixed > String and String.SubSequence. I believe it is necessary for natural order > sorting to work for mixed collections of String and String.SubSequence > instances. > > Would this proposal cause your applications any problems? Is this proposal > absolutely necessary for your application to have adequate performance? Have > you already made other accommodations for the altered performance behaviour > of Strings introduced in 7u6? Other thoughts? > > Mike