For my own elucidation...

If you choose to atomize a char sequence, it incurs costs.

1. Time: you have to check if the char sequence has been previously
atomized. In our implementation, this involves a hash table lookup,
which includes a hash computation.

2. Time: the GC has to deal with the atoms table.

3. Space: you need to store a pointer in the atoms table. In our
implementation, this costs about 4 words (1 word for the JSAtom*, 1
word for the hash value, and hash tables are typically 50% full). If
you intern the atom, this space cost is permanent.

Hopefully these costs are smaller than the potential savings.

1. Time: you can do O(1) equality comparisons, rather than O(n).

2. Space: if the char sequence occurs N times, you avoid storing it N-1 times.

Lots of trade-offs there, which makes things interesting.

Ideas for improvement:

- My suggestion about hashing only the first and last N chars in the
sequence is aimed at cost 1. It would reduce the hash computation cost
for long strings, at the risk of causing more collisions.

- If we used a more compact data structure for the atoms table, that
would reduce cost 3, but it may worsen costs 1 and 2.

- If we can find places where we are atomizing unnecessarily (e.g. we
don't need pointer equality comparisons, and duplicate char sequences
are rare) that would reduce all three costs.

- Sharing the atoms table between JSRuntimes would help reduce cost 3.

- Jan suggested an atom cache for the Tokenizer
(https://bugzilla.mozilla.org/show_bug.cgi?id=964253#c18) which would
reduce cost 1.

I did some rough measurements yesterday after some basic browsing and
saw that about 70% of our strings are atoms, which was higher than I
expected.

Nick
_______________________________________________
dev-tech-js-engine-internals mailing list
dev-tech-js-engine-internals@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals

Reply via email to