> From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
> Date: Tue, 26 Sep 2000 15:11:23 +1100

> For 16bit character ranges, it would be necessary to
> directly store negated character sets (such as [^abc]).
> >From what he told me, Doitse Swierstra is working on a lexer
> that is using explicit ranges, but I am not sure whether he
> also has negated ranges.

People with experience from other Unicode-enabled environments will
expect support for character classes like letter or digit --- which in
Unicode are not simple single ranges, but widely scattered over the
map. (Just look at Latin-1, where you have to use [A-Za-zÀ-ÖØ-öø-ÿ]
because two arithmetic operators snuck into the accented character
range. (Blame the French)).

Such support will also allow your parser to work with the next, bigger
version of Unicode, since the parser library should just inherit the
character class support from the Haskell runtime, which should in turn
get it from the OS. The OS people are already doing the work to get
the necessary tables and routines compressed into a few kilobytes.

Also, Unicode isn't 16-bit any more, it's more like 20.1 bits --- the
range is hex 0 to 1fffff. Although the official character assignments
will stay below hex 20000 or so, your code may have to work on systems
with private character assignments in the hex 100000+ range.

Lars Mathiesen (U of Copenhagen CS Dep) <[EMAIL PROTECTED]> (Humour NOT marked)

Reply via email to