On Tue, 2003-12-02 at 12:59, Finn Bock wrote: > I'm resending this mail since it hasn't yet shown up in the archives. > I'm sorry about any duplicates. > > [John Austin] > > > 4) Changed the handling of strings at the for-loop storing the > > attributes received from the parser in startElement( ... ) > > > > // Process attributes > > for (int i=0; i<atts.getLength(); i++) { > > DefaultMutableTreeNode attribute = > > new DefaultMutableTreeNode(("Attribute (name = '" + > > atts.getLocalName(i) + > > "', value = '" + > > atts.getValue(i) + > > "')").intern() ); > > > > So I intern these strings rather than storing new strings. > > Here you are also interning the attribute values, right?
Yes. > Interning is best used with discretion. The real problem, which isn't > really spelled out in the gotchas.html page, is that the interning > algorithm is completely undefined by the java spec. > > F.ex, in jdk1.4 the intern table (in symbolTable.[hpp,cpp]) is a fixed > size hashing table (size of 20011) with chaining buckets. So when the > total number of intern'ed string grows beyond that number, the interning > process becomes linear in time. Try the attached test program to see the > effect. Gasp! SUN implemented something might not scale up ? Note that my example of last night, the SAXTreeValidator.java results processed the 'defguide' file that has 117 unique names and 13,520 unique values. The memory saved by interning was impressive (184M down to 87M = 97M reduction). [My results may be artificially good as I have interned character strings as well. I expect that the nature of DEFGUIDE includes many repeated character strings. I shall re-run the benchmark: Hmm .. the interning version of the program now uses 104M, a reduction of 80M rather than the previous reduction of 97M. The number of interned strings must have been well over the hash table size, but the difference in CPU usage to parse that file was less than ten seconds (more for interned character strings).] I wondered how much chaining there has to be before performance gets really bad when I checked your program more closely. Your example program would produce external chains of length: 2000000/20011 ~ 100. Because the table keys are constructed, I expect your access times are uniformly distributed and average access times reflect that. Your demonstration program artificially employs 2 million strings which is not a behavior we would expect for FOP. The number of attribute names is limited (by the XSL-FO Spec) and the number of distinct values is limited by some probability distributions that are definitely not Uniform. Takes a LOT of ransom-note typography to make that many unique property values. > I also think that two different aspects of interning is being mixed > around in your measurements here. > > 1. The identity sharing. > 2. The memory sharing. I have not changed the programs beyond calling intern() on strings passed to some constructors. So no identity sharing is present. > The identity sharing can quite possible give a performance boost to the > lookup of the attribute names. I prefer to optimize from measurements. The FOP high runner is property lookup. I didn't see a lot of time in String.equals(), I may look again. Peter's alt-design all ready provides this functionality. I am chasing parallel memory effects. > The memory sharing can quite possible give a memory boost for duplicated > attribute values and a performance boost during garbage collection. I agree, esp in light of the counts I made using a Perl program and the large file defguide.fo. > Ad.1: If one decide that all attribute names must be interned before > insertion and also before lookup of attributes, a special hashtable > (identity-hashtable) can be coded that is significantly faster than a > normal value-based hashtable. As an example of the performance boost, > the hash calculation can be done as: I would use that if i felt that intern() wasn't theraputic. So far, I just intern the name and value for each attribute passed to the SAX parser callback: startElement. This stores the interned string in the parsed tree and lets the parser trash it's copies of strings whenever. > int index = (System.identityHashCode(key) & 0x7fffffff) % maxindex; > > In addition attribute names can be compared with '==' instead of equals. > The downside is that the lookup can only be done with intern'ed keys. But these values would be encapsulated in a class. Good reason for not breaking encapsulation. > Such an identity-hash table has been used by the jython project as a > significant performance boost in jython's attribute lookups. I believe it. > But, as Glenn noticed, the attribute names can also be implemented with > enumeration tokens. PropNames class in Peter's alt-design. > Ad.2: The memory sharing can implemented by using intern'ed string but > it is a very bad idea to do that. A normal java.util.Hashtable can be > used just as well for the implementation of memory sharing: HashMap you mean. > public String getSharedValue(String value) { > String ret = hashTable.get(value); > if (ret != null) > return ret; > hashTable.put(value, value); > return value; > } > > The hashTable can then be cleared at any time, perhaps at the end of > each page sequence. I agree with the idea that we could implement a separate internment facility easily. I am not quite convinced that it needs to be done in an application. This COULD give someone the idea that an object interning mechanism could be implemented at the language level. There's an idea for a stand-alone product ... (or a JSR?) > regards, > finn > > ______________________________________________________________________ > public class interntst { > public static void main(String[] args) throws Exception { > new interntst().test(); > } > > public void test() { > int cnt = 2000000; In my large test file, there are 117 property names and 13,520 property values. A lot less than the hash table size of 20,011. > String[] arr = new String[cnt]; > long now = System.currentTimeMillis(); > for (int i = 0; i < cnt; i++) { > if ((i % 1000) == 0) { > System.out.println(i + " " + (System.currentTimeMillis() - now)); > now = System.currentTimeMillis(); > } > String s = ("attribute" + i).intern(); > arr[i] = s; > } > } > } -- John Austin <[EMAIL PROTECTED]>