> -----Original Message-----
> From: Jeremias Maerki [mailto:[EMAIL PROTECTED]
> Thanks to Finn and Andreas for looking at this. Given what I see here
> I'd say the two intern() calls in FOTreeBuilder would better be removed.
> The Set the namespaces are stored in works with equals() anyway, so I
> don't see the point of interning the Strings.

Not exactly... Maybe they had better be removed, but let's make sure it
isn't for the wrong reasons.

> Or do I miss anything here?

Could be. Ultimately, apart from the string's lengths and the number of
identical copies that are going to be alive at a given point, this would
depend on:
a) the number of times the relevant portions of code --addElementMapping() +
conditional in findFOMaker()-- are actually executed
b) how many times those particular strings --URIs-- are going to be compared
to other potentially interned strings: see my addition of measurement
results with only one of the strings interned

To repaint the full picture:


static final String s1 = "some-string-value";

is the same as

static final String s1 = "some-string-value".intern();

All strings that are subsequently assigned the value of s1, will be
reference strings pointing to the same canonical string.

String s2 = s1;

is exactly the same as:

String s2 = s1.intern();

(so: (s1 == s2) == true)

which is precisely why the following is considered Bad
(unless you need a really long string for a very brief moment, just once)

String s3 = new String("some-string-value");

(so: (s1 == s3) == false; s1.equals(s3) == true)

Effects similar to those of the latter statement --different String
instances with the same string value-- are inevitable when the Strings
originate from a file or database, or are built at runtime using
StringBuffer.toString() --anywhere the value isn't known at compile-time.
Hence the option of intern() to allow the compiler to optimize the bytecode.
Optimizations among which you'll find: using bytecode for reference
comparison, unless when explicitly asked not to do so (by explicitly using
only equals()).
One thing I noticed was that, once both strings to be compared were
guaranteed to be interned at compile-time it didn't even matter anymore
whether or not the values were the same, so '== || equals()' gave exactly
the same results as plain '=='. Apparently, the compiler could figure out
that in this situation, equals() would never really be needed, or would lead
to the same results anyway.
Strange though, that it optimizes only partly when both strings are assigned
the same hard-coded literal --in that case, when the string values are
different, the results of || would indicate the equals() side *is* evaluated
(maybe even the only side evaluated, because at compile-time the strings are
known to be different, but given the source-code, the compiler somehow
cannot exclude that they are going to be different at run-time...?)

String.intern() should indeed be used with care. It's not a good idea to
intern a string at random, but if that happens only in a relatively small
number of situations, then it won't do much harm. If the call to intern() is
going to be made many times, you have to take into account that it
ultimately maps to a native (JNI) method.

To be absolutely sure, it would probably be wise to check for a threshold:
i.e. at what point does the overhead of interning really become a
drawback --we already know from the measurements that it's still worthwhile
to intern() once, if the number of subsequent comparisons is sufficiently
large (10^8).

AFAICT, the preferred option would be to introduce a layer --a pool of
interned strings-- in between, where you go:

if( !contains( string ) ) {
  add( string.intern() );
return get( string );

So, the use of equals() (via: contains()) on separate instances remains
The local table is guaranteed to return a reference string every time, but
interning happens only the first time a given string value is encountered at
run-time. Once the string is added to the local table, the call to intern()
is avoided altogether and replaced with a faster get() --another drain
caused by intern() is precisely the lookup in a much larger global table to
check if an instance with that value already exists.
The only thing to remain aware of is that, by implementing such a map, one
might end up holding references to some string-values that are used only
once, keeping them from being garbage-collected.

The real fun begins when you create Strings as substrings of interned
You could build one large string containing, say all possible names in a
document. If you interned this string only once, and used String.substring()
to create individual names later on, then from the POV of the compiler, all
of them would be pointers into different parts of one and the same string,
each of the names and all of their copies taking up the space of an int no
matter how long they actually are. Even without knowing the exact value of
the canonical string at compile-time, the compiler still will be able to
generate much more efficient code in many places, at the 'cost' of only one
intern() at run-time.

Hope any of this is useful...



Reply via email to