For my own elucidation... If you choose to atomize a char sequence, it incurs costs.
1. Time: you have to check if the char sequence has been previously atomized. In our implementation, this involves a hash table lookup, which includes a hash computation. 2. Time: the GC has to deal with the atoms table. 3. Space: you need to store a pointer in the atoms table. In our implementation, this costs about 4 words (1 word for the JSAtom*, 1 word for the hash value, and hash tables are typically 50% full). If you intern the atom, this space cost is permanent. Hopefully these costs are smaller than the potential savings. 1. Time: you can do O(1) equality comparisons, rather than O(n). 2. Space: if the char sequence occurs N times, you avoid storing it N-1 times. Lots of trade-offs there, which makes things interesting. Ideas for improvement: - My suggestion about hashing only the first and last N chars in the sequence is aimed at cost 1. It would reduce the hash computation cost for long strings, at the risk of causing more collisions. - If we used a more compact data structure for the atoms table, that would reduce cost 3, but it may worsen costs 1 and 2. - If we can find places where we are atomizing unnecessarily (e.g. we don't need pointer equality comparisons, and duplicate char sequences are rare) that would reduce all three costs. - Sharing the atoms table between JSRuntimes would help reduce cost 3. - Jan suggested an atom cache for the Tokenizer (https://bugzilla.mozilla.org/show_bug.cgi?id=964253#c18) which would reduce cost 1. I did some rough measurements yesterday after some basic browsing and saw that about 70% of our strings are atoms, which was higher than I expected. Nick _______________________________________________ dev-tech-js-engine-internals mailing list dev-tech-js-engine-internals@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals