[gwt-contrib] Re: Unicode support for Character.is* methods

Pascal Muetschard Wed, 31 Mar 2010 14:43:58 -0700

I have uploaded another patch set to http://gwt-code-reviews.appspot.com/226801
to address the concerns raised. See inline messages below.

This latest version has an ASCII only option for the is*() methods
witch has an overhead of a couple hundred bytes. See below for the
size penalties for the tables:

is*() methods: 5396 (all of them, not each)
getDirectionality(): 2627
getType(): 4112
getNumericValue: 2163
digit(char,int): 6779 (uses isDigit() and getNumericValue())

TOTAL: 11681 (at savings of 2617 - the shared code between each is
about 700 bytes)

I feel like the size argument is well met - if using all the tables,
the penalty is a mere 11k - comparing with the bootstrap code usually
at 5k and HashMap at 15k, that's quite small.

On Mar 16, 7:51 pm, [email protected] wrote:
> A few issues:
>
> - the way this is divided, all of the code will get pulled into every
> app that calls any of these methods.

I had thought about separating the tables out for each of the is*()
methods, however, the extra code each of the objects adds quickly
becomes much larger than the data of the tables. This means that we
would sacrifice the runtime size of the common case for the corner
case where only a single is*() method is used. Also, the inherit
relationship (i.e. if isDefined is false, all others are false as
well) and mutual exclusion (isUpperCase vs isLowerCase vs isDigit)
between the attributes cut out a lot of duplication by combing the
tables.

I have also added the deferred property and your "ASCII version."
However, I've made unicode the default, as I feel like that's what
people expect from GWT - to be i18n compatible by default.

> - this is incomplete and doesn't have the other properties, such as
> getDirection, toLower, etc.

I've added getDirectionality(), getType(), getNumericValue() and
digit(char,int). I have excluded the to*Case() methods on purpose, as
their definition is not i18n correct - there are upper case characters
that need more than one character in lower case and vice versa.

>
> I had written a full implementation a while ago (it is still available
> in svn at changes/jat/ucd), which encoded each table separately with a
> combination of run-length encoding and huffman coding the runs, which
> got the size of individual tables down to a few hundred bytes each, and
> you only paid for the tables that were used.  The decompression code was
> of course larger, so maybe there is room for a simpler encoding
> mechanism that takes less code even if the data is larger.

I have looked at this and mine is similar. My version also encodes run
lengths and makes sure that the most common "tokens" have the smallest
representation. It also uses LZW to compress the data. This
compression is simple and needs a lot less code to decompress, but
still provides a good compression ratio.

>
> That effort was complete but was never merged in because some people
> objected to the code size increase.  Given the synchronous nature of the
> API, it isn't feasible to fetch the tables on-demand from a server, so
> they have to be downloaded with the code (they can go into different
> runAsync fragments though).
>
> I hope to work on that and other i18n issues next quarter, but I am not
> sure how much time I will have to work on it.

Which is why I'm trying to help with this :)

>
> http://gwt-code-reviews.appspot.com/226801

-- 
http://groups.google.com/group/Google-Web-Toolkit-Contributors

To unsubscribe, reply using "remove me" as the subject.

[gwt-contrib] Re: Unicode support for Character.is* methods

Reply via email to