Re: compact char mapping

Markus Kuhn Mon, 03 Jul 2000 10:59:01 -0700
[EMAIL PROTECTED] wrote on 2000-07-03 17:29 UTC:
> in analogy to your compact char class function for word selection, 
> is there such a compact function already available for toupper and 
> tolower (and perhaps totitle) mapping? I've browsed the new glibc and 
> it seems a large table is used there for that purpose.

I don't know of any, but here is how I would implement it: Cut the Unicode
space into intervals x..y, in which the 4-tuple

  (A, B, C, D)

is constant, where 

  toupper(c) = (c & 1) ? A+c : B+c
  tolower(c) = (c & 1) ? D+c : E+c

for all x <= c <= y.

Then search for c in a binary-search the matching stored interval and
evaluate toupper(c) and tolower(c) as above. Totitle(c) probably can be
derived from these with just a small stored exception table (another
binary or hash search).

This allows both regions with separate uppercase and lowercase block
(Latin-1), as well as regions with alternating cases (Latin-A) to be
encoded as entire intervals.

I would be surprised if more than 2000 bytes are needed for storing the
entire interval case mapping table.

For performance reasons, you might want to cover some hot spots (such as
the ASCII region) with slightly faster lookup tables, but only if
profiling suggests that you spend too much time in tolower/toupper.

If you work with Unicode and you find yourself using 2**16 large tables,
then this is definitely a sign that you do something very wrong.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: compact char mapping

Reply via email to