The existing Character.isLowerCase/etc and String.toLowerCase/toUpperCase do
not work for Unicode characters in general. This is not acceptable for a
fully internationalized app that cares about character attributes, such as a
word processor.
I have implemented a compression scheme which uses a combination of
run-length and Huffman encoding to make the tables as small as possible.
First, if the tables aren't used they will not be included (nor will the
decompression code), and any tables you don't use won't be included if you
do use any. The compressed tables are broken into 8k character blocks and
decompressed on the first reference to a given block, so typically only the
first block will be decompressed. The binary property tables (ie, isLetter,
isDigit, isLowerCase, etc) compress rather well, typically to a few hundred
bytes per table. Property tables, such as toLowerCase or getType, require
more space because the values take more space to store and cannot just be
assumed to alternate like the binary values after run compression. The
decompression code takes around 9k before compression. To further save
space, some of the tables are stored relative to some other table, such as
isJavaIdentifierStart is stored as the XOR of isLetterOrDigit.
This is all on changes/jat/ucd, which is a pre-1.5 branch. The compiler may
be better at inlining/etc since the fork, but I think that is likely to only
reduce the cost of adding this functionality and wanted to get feedback
before updating it to the current trunk.
To get an idea of the impact of adding this functionality, I added a small
bit of code to Showcase to call a Character.* method on the hash part of the
URL (just so the compiler would have no way to know what the value was), and
do a Window.alert on it (so the compiler couldn't discard it).
- Unmodified
- sum(bytes(*.cache.html) = 6,640,123
- sum(bytes(*.cache.html.gz) = 2,069,283
- avg(bytes per permutation, uncompressed) = 265,605
- avg(bytes per permutation, compressed) = 82,771
- avg(compression ratio) = 31.2% (lower is better)
- Old ASCII-only isLowerCase code added (along with Location and
Window.alert additions)
- sum(bytes(*.cache.html) = 6,648,933
- sum(bytes(*.cache.html.gz) = 2,071,623
- avg(bytes per permutation, uncompressed) = 265,957
- avg(bytes per permutation, compressed) = 82,865
- avg(compression ratio) = 31.2%
- New UCD isLowerCase (as above, with module entry to activate)
- sum(bytes(*.cache.html) = 6,883,614
- sum(bytes(*.cache.html.gz) = 2,181,641
- avg(bytes per permutation, uncompressed) = 275,345
- avg(bytes per permutation, compressed) = 87,266
- avg(compression ratio) = 31.7%
- avg(absolute size increase, uncompressed) = 9,388
- avg(absolute size increase, compressed) = 4,401
- avg(increase in uncompressed size over ASCII-only) = 3.5%
- avg(increase in compressed size over ASCII-only) = 5.3%
- ASCII-only getType code added (along with Location and Window.alert
additions)
- sum(bytes(*.cache.html) =6,656,933
- sum(bytes(*.cache.html.gz) = 2,074,279
- avg(bytes per permutation, uncompressed) = 266,277
- avg(bytes per permutation, compressed) = 82,971
- avg(compression ratio) = 31.2%
- New UCD getType (as above, with module entry to activate)
- sum(bytes(*.cache.html) = 6,969,939
- sum(bytes(*.cache.html.gz) = 2,238,000
- avg(bytes per permutation, uncompressed) = 278,798
- avg(bytes per permutation, compressed) = 89,520
- avg(compression ratio) = 32.1%
- avg(absolute size increase, uncompressed) = 12,521
- avg(absolute size increase, compressed) = 6,549
- avg(increase in uncompressed size over ASCII-only) = 4.7%
- avg(increase in compressed size over ASCII-only) = 7.9%
- ASCII-only getType, isLowerCase, isLetter, isDigit code added (along
with Location and Window.alert additions)
- sum(bytes(*.cache.html) =6,657,983
- sum(bytes(*.cache.html.gz) = 2,074,576
- avg(bytes per permutation, uncompressed) = 266,319
- avg(bytes per permutation, compressed) = 82,983
- avg(compression ratio) = 31.2%
- New UCD getType, isLowerCase, isLetter, isDigit (as above, with
module entry to activate)
- sum(bytes(*.cache.html) = 7,019,239
- sum(bytes(*.cache.html.gz) = 2,275,592
- avg(bytes per permutation, uncompressed) = 280,770
- avg(bytes per permutation, compressed) = 91,024
- avg(compression ratio) = 32.4%
- avg(absolute size increase, uncompressed) = 14,451
- avg(absolute size increase, compressed) = 8,041
- avg(increase in uncompressed size over ASCII-only) = 5.4%
- avg(increase in compressed size over ASCII-only) = 9.7%
*Summary*
- Single binary property adds 9.4k uncompressed and 4.4k after
compression
- Single numeric property adds 12.5k uncompressed and 6.5k after
compression
- Three binary properties and one numeric property adds 14.5k
uncompressed and 8k after compression
As you can see, the incremental cost for multiple tables is pretty small.
The compression ratio is slightly affected because the compressed tables
(stored in JS strings) have more dense information content than the rest of
the JS code.
Obviously, this is a significant size increase to a toy application, but to
me it seems completely reasonable for a larger applciation that needs this
functionality. Initially, the existing ASCII-only implementation is the
default and you only get the UCD version if you set the charTables property
to ucd in your module file.
Comments?
--
John A. Tamplin
Software Engineer (GWT), Google
--~--~---------~--~----~------------~-------~--~----~
http://groups.google.com/group/Google-Web-Toolkit-Contributors
-~----------~----~----~----~------~----~------~--~---