Nice writeup!
Two questions:

1) Is there anything about the structure of these tables that could
unusually affect performance, especially startup time (e.g. forcing unusual
parsing code paths in the JS engine, putting unreasonable pressure on the
GC)? I wouldn't expect that to be the case, but it's important to ensure
that we're unlikely to be triggering any pathological JS VM behaviors if the
code/data structure is particularly unusual.

2) Have you discussed with Lex and Scott how amenable your scheme is to
GWT.runAsync()? If you can ensure that it's designed  such that you could
download the decompression/parsing code and data tables after the startup
fragment, that would be ideal. I *think* runAsync() can move global
constants into separate fragments, but it's worth double-checking before you
commit to an implementation.

-- Bruce

On Thu, Sep 4, 2008 at 7:42 PM, John Tamplin <[EMAIL PROTECTED]> wrote:

> The existing Character.isLowerCase/etc and String.toLowerCase/toUpperCase
> do not work for Unicode characters in general.  This is not acceptable for a
> fully internationalized app that cares about character attributes, such as a
> word processor.
>
> I have implemented a compression scheme which uses a combination of
> run-length and Huffman encoding to make the tables as small as possible.
> First, if the tables aren't used they will not be included (nor will the
> decompression code), and any tables you don't use won't be included if you
> do use any.  The compressed tables are broken into 8k character blocks and
> decompressed on the first reference to a given block, so typically only the
> first block will be decompressed.  The binary property tables (ie, isLetter,
> isDigit, isLowerCase, etc) compress rather well, typically to a few hundred
> bytes per table.  Property tables, such as toLowerCase or getType, require
> more space because the values take more space to store and cannot just be
> assumed to alternate like the binary values after run compression.  The
> decompression code takes around 9k before compression.  To further save
> space, some of the tables are stored relative to some other table, such as
> isJavaIdentifierStart is stored as the XOR of isLetterOrDigit.
>
> This is all on changes/jat/ucd, which is a pre-1.5 branch.  The compiler
> may be better at inlining/etc since the fork, but I think that is likely to
> only reduce the cost of adding this functionality and wanted to get feedback
> before updating it to the current trunk.
>
> To get an idea of the impact of adding this functionality, I added a small
> bit of code to Showcase to call a Character.* method on the hash part of the
> URL (just so the compiler would have no way to know what the value was), and
> do a Window.alert on it (so the compiler couldn't discard it).
>
>    - Unmodified
>       - sum(bytes(*.cache.html) = 6,640,123
>       - sum(bytes(*.cache.html.gz) = 2,069,283
>       - avg(bytes per permutation, uncompressed) = 265,605
>       - avg(bytes per permutation, compressed) = 82,771
>       - avg(compression ratio) = 31.2% (lower is better)
>       - Old ASCII-only isLowerCase code added (along with Location and
>    Window.alert additions)
>    - sum(bytes(*.cache.html) = 6,648,933
>       - sum(bytes(*.cache.html.gz) = 2,071,623
>       - avg(bytes per permutation, uncompressed) = 265,957
>       - avg(bytes per permutation, compressed) = 82,865
>       - avg(compression ratio) = 31.2%
>       - New UCD isLowerCase (as above, with module entry to activate)
>       - sum(bytes(*.cache.html) = 6,883,614
>       - sum(bytes(*.cache.html.gz) = 2,181,641
>       - avg(bytes per permutation, uncompressed) = 275,345
>       - avg(bytes per permutation, compressed) = 87,266
>       - avg(compression ratio) = 31.7%
>       - avg(absolute size increase, uncompressed) = 9,388
>       - avg(absolute size increase, compressed) = 4,401
>       - avg(increase in uncompressed size over ASCII-only) = 3.5%
>       - avg(increase in compressed size over ASCII-only) = 5.3%
>    - ASCII-only getType code added (along with Location and Window.alert
>    additions)
>    - sum(bytes(*.cache.html) =6,656,933
>       - sum(bytes(*.cache.html.gz) = 2,074,279
>       - avg(bytes per permutation, uncompressed) = 266,277
>       - avg(bytes per permutation, compressed) = 82,971
>       - avg(compression ratio) = 31.2%
>       - New UCD getType (as above, with module entry to activate)
>       - sum(bytes(*.cache.html) = 6,969,939
>       - sum(bytes(*.cache.html.gz) = 2,238,000
>       - avg(bytes per permutation, uncompressed) = 278,798
>       - avg(bytes per permutation, compressed) = 89,520
>       - avg(compression ratio) = 32.1%
>       - avg(absolute size increase, uncompressed) = 12,521
>       - avg(absolute size increase, compressed) = 6,549
>       - avg(increase in uncompressed size over ASCII-only) = 4.7%
>       - avg(increase in compressed size over ASCII-only) = 7.9%
>       - ASCII-only getType, isLowerCase, isLetter, isDigit code added
>    (along with Location and Window.alert additions)
>    - sum(bytes(*.cache.html) =6,657,983
>       - sum(bytes(*.cache.html.gz) = 2,074,576
>       - avg(bytes per permutation, uncompressed) = 266,319
>       - avg(bytes per permutation, compressed) = 82,983
>       - avg(compression ratio) = 31.2%
>       - New UCD getType, isLowerCase, isLetter, isDigit (as above, with
>    module entry to activate)
>       - sum(bytes(*.cache.html) = 7,019,239
>       - sum(bytes(*.cache.html.gz) = 2,275,592
>       - avg(bytes per permutation, uncompressed) = 280,770
>       - avg(bytes per permutation, compressed) = 91,024
>       - avg(compression ratio) = 32.4%
>       - avg(absolute size increase, uncompressed) = 14,451
>       - avg(absolute size increase, compressed) = 8,041
>       - avg(increase in uncompressed size over ASCII-only) = 5.4%
>       - avg(increase in compressed size over ASCII-only) = 9.7%
>
> *Summary*
>
>    - Single binary property adds 9.4k uncompressed and 4.4k after
>    compression
>    - Single numeric property adds 12.5k uncompressed and 6.5k after
>    compression
>    - Three binary properties and one numeric property adds 14.5k
>    uncompressed and 8k after compression
>
> As you can see, the incremental cost for multiple tables is pretty small.
>
> The compression ratio is slightly affected because the compressed tables
> (stored in JS strings) have more dense information content than the rest of
> the JS code.
>
> Obviously, this is a significant size increase to a toy application, but to
> me it seems completely reasonable for a larger applciation that needs this
> functionality.  Initially, the existing ASCII-only implementation is the
> default and you only get the UCD version if you set the charTables property
> to ucd in your module file.
>
> Comments?
>
> --
> John A. Tamplin
> Software Engineer (GWT), Google
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
http://groups.google.com/group/Google-Web-Toolkit-Contributors
-~----------~----~----~----~------~----~------~--~---

Reply via email to