I'm resending this mail since it hasn't yet shown up in the archives. I'm sorry about any duplicates.

[John Austin]

4) Changed the handling of strings at the for-loop storing the
   attributes received from the parser in startElement( ... )

// Process attributes
for (int i=0; i<atts.getLength(); i++) {
DefaultMutableTreeNode attribute =
new DefaultMutableTreeNode(("Attribute (name = '" +
atts.getLocalName(i) + "', value = '" +
atts.getValue(i) +
"')").intern() );


So I intern these strings rather than storing new strings.

Here you are also interning the attribute values, right?


Interning is best used with discretion. The real problem, which isn't really spelled out in the gotchas.html page, is that the interning algorithm is completely undefined by the java spec.

F.ex, in jdk1.4 the intern table (in symbolTable.[hpp,cpp]) is a fixed size hashing table (size of 20011) with chaining buckets. So when the total number of intern'ed string grows beyond that number, the interning process becomes linear in time. Try the attached test program to see the effect.



I also think that two different aspects of interning is being mixed around in your measurements here.

1. The identity sharing.
2. The memory sharing.

The identity sharing can quite possible give a performance boost to the lookup of the attribute names.

The memory sharing can quite possible give a memory boost for duplicated attribute values and a performance boost during garbage collection.


Ad.1: If one decide that all attribute names must be interned before insertion and also before lookup of attributes, a special hashtable (identity-hashtable) can be coded that is significantly faster than a normal value-based hashtable. As an example of the performance boost, the hash calculation can be done as:


int index = (System.identityHashCode(key) & 0x7fffffff) % maxindex;

In addition attribute names can be compared with '==' instead of equals.
The downside is that the lookup can only be done with intern'ed keys.

Such an identity-hash table has been used by the jython project as a significant performance boost in jython's attribute lookups.

But, as Glenn noticed, the attribute names can also be implemented with enumeration tokens.


Ad.2: The memory sharing can implemented by using intern'ed string but it is a very bad idea to do that. A normal java.util.Hashtable can be used just as well for the implementation of memory sharing:


public String getSharedValue(String value) {
   String ret = hashTable.get(value);
   if (ret != null)
       return ret;
   hashTable.put(value, value);
   return value;
}

The hashTable can then be cleared at any time, perhaps at the end of each page sequence.

regards,
finn

public class interntst {
    public static void main(String[] args) throws Exception {
        new interntst().test();
    }

    public void test() {
        int cnt = 2000000;
        String[] arr = new String[cnt];
        long now = System.currentTimeMillis();
        for (int i = 0; i < cnt; i++) {
            if ((i % 1000) == 0) {
                System.out.println(i + " " + (System.currentTimeMillis() - now));
                now = System.currentTimeMillis();
            }
            String s = ("attribute" + i).intern();
            arr[i] = s;
        }
    }
}

Reply via email to