Hi Mike,
I agree with Peter,
a new method is better than changing subSequence to return a
CharSequence implementation that try to fake itself as a String.
And as a guys that have written several parsers (for HTTP servers or not),
ByteBuffer/CharBuffer was introduced in 1.4 to write effective parsers
in Java,
no one should try write a Java parser using String buffers anymore.
Rémi
On 02/19/2013 10:28 AM, Peter Levart wrote:
Hi Mike,
Regarding the implementation details: I think it would be better to
reference just the String's value[] array in the String.SubSequence
instead of the String instance itself. Two reasons come to mind:
- eliminate another indirection when accessing the array
- when referencing the String instance, code like this would leak
implementation details (again):
String s = new String("...");
CharSequence cs = s.subSequence(0, s.length()-1);
WeakReference<String> wr = new WeakReference<>(s);
// ... if we keep a reference to cs, wr will never be cleared...
Regarding the strategy: It's very unfortunate that code exists that
uses String.subSequence result in a way treating it as an object with
value semantics for equals() and hashCode() despite the specification
for CharSequence clearly stating:
/"This interface does not refine the general contracts of the
//|equals|
<http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#equals%28java.lang.Object%29>//and
//|hashCode|
<http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29>//methods.
The result of comparing two objects that implement //CharSequence//is
therefore, in general, undefined. Each object may be implemented by a
different class, and there is no guarantee that each class will be
capable of testing its instances for equality with those of the other.
It is therefore inappropriate to use arbitrary
//CharSequence//instances as elements in a set or as keys in a map."/
The excuse for such usage is the specification of String.subSequence
method:
/"Returns a new character sequence that is a subsequence of this
sequence. /
/An invocation of this method of the form /
/ str.subSequence(begin, end)/
/behaves in exactly the same way as the invocation /
/ str.substring(begin, end)/
/This method is defined so that the //String//class can implement the
//|CharSequence|
<http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html>//interface.
"/
So this proposal actually breaks this specification. It tries to break
it in a way that would keep some of the usages still behave like
before, but it can't entirely succeed. The following code is valid
now, but will throw CCE with this proposal:
String s = ...;
String s0 = (String) s.subSequence(0, 1);
One can imagine other scenarios that would break (like custom
comparators casting to String, etc.).
So to be entirely compatible, an escape-hatch would have to be
available (in the form of a System property for example) to restore
the old behaviour if requested.
If there is an escape-hatch, the question arises: Why not breaking the
String.subSequence entirely? We could brake it so that the method
would clearly specify the returned CharSequence behaviour regarding
underlying value[] array sharing, hashCode(), equals(), subSequence()
and Comparable only in terms of CharSequence instances returned from
the method and not cross String <-> String.SubSequence.
This would encourage migration of behaviourally incompatible code to
forms that are compatible with both pre JDK8 and post JDK8
String.subSequence specification instead of keeping the status quo.
There's also a third option - adding new method (back-porting it to
JDK7):
public String.SubSequence stringSubSequence(int, int);
...and keeping subSequence as is.
Regards, Peter
On 02/19/2013 06:27 AM, Mike Duigou wrote:
Hello all;
JDK 7u6 included a significant change to java.lang.String. The change
was internal to the String implementation and didn't result in any
API changes but it does have a significant impact on the performance
of some uses cases. (See
http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html
for an earlier discussion)
Prior to 7u6 String maintained two fields "count" and "offset" which
described the location within the character array of the String's
characters. In 7u6 the count and offset fields were removed and
String instances no longer share their character arrays with other
String instances. Each substring(), subSequence() or split() now
creates entirely new String instances for it's results. Before 7u6
two or more instances of String could share the same backing
character array. A number of String operations; clone(), split(),
subString(), subSequence() did not copy the backing character array
but created new String instances pointing at their character array
with appropriate "count" and "offset" values.
As with most sharing techniques, there are tradeoffs. The
"count/offset" approach works reasonably well for cases where the
substrings have a shorter lifetime than the original. Frequently it
was found though that the large character arrays "leaked" through
small Strings derived from larger String instances. Examples would be
tokens parsed with substring() from HTTP headers being used as keys
in Maps. This caused the entire header character array from the
original header String to not be garbage collected (or worse the
entire set of headers for a request).
Our benchmarking and application testing prior to changing the String
implementation in 7u6 showed that it was a net benefit to not share
the character array instances. The benchmarking and performance
testing for this change actually began in 2007 and was very
extensive. Benchmarking and performance analysis since the release of
7u6 continues to indicate that removal of sharing is the better
approach. It is extremely unlikely that we would consider returning
to the pre-7u6 implementation (in case you were thinking of
suggesting that).
There are some cases where the new approach can result in significant
performance penalties. This is a "For Your Consideration" review not
a pre-push changeset. The review changeset is a weakening of the
"never share the String character array" rule and it means that it
would suffer from exactly the same weakness. Few applications
currently use subSequence() most currently use subString().
Applications which would benefit from this change would have to
switch to using subSequence(). Apps can safely switch to subSequence
in anticipation of this change because currently subSequence() is
identical to substring(). This means that should this changeset not
be integrated app code would suffer no penalty and if this change is
eventually integrated then app performance will improve.
http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/
From our current testing we found that applications currently using
subSequence() failed if the equals(), hashCode() and toString()
implementations did not exactly match String. Additionally we had to
change String.equals() so that it recognizes can return "true" for
matching instances of String.SubSequence.
You will see some unfortunate potential usage patterns in the
presented implementation--most specifically, calling toString() on
the result of a String.subSequence() results in a new String instance
being created (ouch!). I would like to eliminate the caching of the
hashCode() result but it appears that it is frequently used and
failing to cache the hash code results in greatly decreased
performance for the relevant applications. Currently TreeSet and
TreeMap which use natural order fail for data sets of mixed String
and String.SubSequence. I believe it is necessary for natural order
sorting to work for mixed collections of String and
String.SubSequence instances.
Would this proposal cause your applications any problems? Is this
proposal absolutely necessary for your application to have adequate
performance? Have you already made other accommodations for the
altered performance behaviour of Strings introduced in 7u6? Other
thoughts?
Mike