Re: String.intern() test and measurement

John Austin Tue, 02 Dec 2003 10:41:14 -0800

On Tue, 2003-12-02 at 12:59, Finn Bock wrote:
> I'm resending this mail since it hasn't yet shown up in the archives. 
> I'm sorry about any duplicates.
> 
> [John Austin]
> 
> > 4) Changed the handling of strings at the for-loop storing the
> >    attributes received from the parser in startElement( ... )
> > 
> >         // Process attributes
> >         for (int i=0; i<atts.getLength(); i++) {
> >             DefaultMutableTreeNode attribute =
> >                 new DefaultMutableTreeNode(("Attribute (name = '" +
> >                                            atts.getLocalName(i) + 
> >                                            "', value = '" +
> >                                            atts.getValue(i) +
> > "')").intern() );
> > 
> >     So I intern these strings rather than storing new strings.
> 
> Here you are also interning the attribute values, right?


Yes.

> Interning is best used with discretion. The real problem, which isn't 
> really spelled out in the gotchas.html page, is that the interning 
> algorithm is completely undefined by the java spec.
> 
> F.ex, in jdk1.4 the intern table (in symbolTable.[hpp,cpp]) is a fixed 
> size hashing table (size of 20011) with chaining buckets. So when the 
> total number of intern'ed string grows beyond that number, the interning 
> process becomes linear in time. Try the attached test program to see the 
> effect.

Gasp! SUN implemented something might not scale up ?

Note that my example of last night, the SAXTreeValidator.java
results processed the 'defguide' file that has 117 unique names and
13,520 unique values. The memory saved by interning was impressive (184M
down to 87M = 97M reduction). 

[My results may be artificially good as I have interned character
strings as well. I expect that the nature of DEFGUIDE includes many
repeated character strings. I shall re-run the benchmark: 

Hmm .. the interning version of the program now uses 104M, a reduction
of 80M rather than the previous reduction of 97M. The number of interned
strings must have been well over the hash table size, but the difference
in CPU usage to parse that file was less than ten seconds (more for
interned character strings).]

I wondered how much chaining there has to be before performance
gets really bad when I checked your program more closely. Your 
example program would produce external chains of length:
2000000/20011 ~ 100. Because the table keys are constructed, I 
expect your access times are uniformly distributed and average
access times reflect that.

Your demonstration program artificially employs 2 million strings
which is not a behavior we would expect for FOP. The number of 
attribute names is limited (by the XSL-FO Spec) and the number of
distinct values is limited by some probability distributions that
are definitely not Uniform. Takes a LOT of ransom-note typography 
to make that many unique property values.

> I also think that two different aspects of interning is being mixed 
> around in your measurements here.
> 
> 1. The identity sharing.
> 2. The memory sharing.

I have not changed the programs beyond calling intern() on strings
passed to some constructors. So no identity sharing is present.

> The identity sharing can quite possible give a performance boost to the 
> lookup of the attribute names.

I prefer to optimize from measurements. The FOP high runner is property
lookup. I didn't see a lot of time in String.equals(), I may look again.

Peter's alt-design all ready provides this functionality. I am chasing
parallel memory effects.

> The memory sharing can quite possible give a memory boost for duplicated 
> attribute values and a performance boost during garbage collection.

I agree, esp in light of the counts I made using a Perl program and the
large file defguide.fo.

> Ad.1: If one decide that all attribute names must be interned before 
> insertion and also before lookup of attributes, a special hashtable 
> (identity-hashtable) can be coded that is significantly faster than a 
> normal value-based hashtable. As an example of the performance boost, 
> the hash calculation can be done as:

I would use that if i felt that intern() wasn't theraputic. So far, I
just intern the name and value for each attribute passed to the SAX
parser callback: startElement. This stores the interned string in the
parsed tree and lets the parser trash it's copies of strings whenever.

>     int index = (System.identityHashCode(key) & 0x7fffffff) % maxindex;
> 
> In addition attribute names can be compared with '==' instead of equals.
> The downside is that the lookup can only be done with intern'ed keys.

But these values would be encapsulated in a class. Good reason for
not breaking encapsulation.

> Such an identity-hash table has been used by the jython project as a 
> significant performance boost in jython's attribute lookups.

I believe it.

> But, as Glenn noticed, the attribute names can also be implemented with 
> enumeration tokens.

PropNames class in Peter's alt-design.

> Ad.2: The memory sharing can implemented by using intern'ed string but 
> it is a very bad idea to do that. A normal java.util.Hashtable can be 
> used just as well for the implementation of memory sharing:

HashMap you mean.

> public String getSharedValue(String value) {
>     String ret = hashTable.get(value);
>     if (ret != null)
>         return ret;
>     hashTable.put(value, value);
>     return value;
> }
> 
> The hashTable can then be cleared at any time, perhaps at the end of 
> each page sequence.

I agree with the idea that we could implement a separate internment
facility easily. I am not quite convinced that it needs to be done
in an application. 

This COULD give someone the idea that an object interning mechanism
could be implemented at the language level. There's an idea for a 
stand-alone product ... (or a JSR?)

> regards,
> finn
> 
> ______________________________________________________________________
> public class interntst {
>     public static void main(String[] args) throws Exception {
>         new interntst().test();
>     }
> 
>     public void test() {
>         int cnt = 2000000;

In my large test file, there are 117 property names and 13,520 property
values. A lot less than the hash table size of 20,011.

>         String[] arr = new String[cnt];
>         long now = System.currentTimeMillis();
>         for (int i = 0; i < cnt; i++) {
>             if ((i % 1000) == 0) {
>                 System.out.println(i + " " + (System.currentTimeMillis() - now));
>                 now = System.currentTimeMillis();
>             }
>             String s = ("attribute" + i).intern();
>             arr[i] = s;
>         }
>     }
> }
-- 
John Austin <[EMAIL PROTECTED]>

Re: String.intern() test and measurement

Reply via email to