Lars Henrik Mathiesen <[EMAIL PROTECTED]> wrote,
> > From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
> > Date: Tue, 26 Sep 2000 15:11:23 +1100
>
> > For 16bit character ranges, it would be necessary to
> > directly store negated character sets (such as [^abc]).
> > >From what he told me, Doitse Swierstra is working on a lexer
> > that is using explicit ranges, but I am not sure whether he
> > also has negated ranges.
>
> People with experience from other Unicode-enabled environments will
> expect support for character classes like letter or digit --- which in
> Unicode are not simple single ranges, but widely scattered over the
> map. (Just look at Latin-1, where you have to use [A-Za-zÀ-ÖØ-öø-ÿ]
> because two arithmetic operators snuck into the accented character
> range. (Blame the French)).
>
> Such support will also allow your parser to work with the next, bigger
> version of Unicode, since the parser library should just inherit the
> character class support from the Haskell runtime, which should in turn
> get it from the OS. The OS people are already doing the work to get
> the necessary tables and routines compressed into a few kilobytes.
Hmm, this seems like a shortcoming in the Haskell spec. We
have all these isAlpha, isDigit, etc functions, but I can't
get at a list of, say, all characters for which isAlpha is
true.
> Also, Unicode isn't 16-bit any more, it's more like 20.1 bits --- the
> range is hex 0 to 1fffff. Although the official character assignments
> will stay below hex 20000 or so, your code may have to work on systems
> with private character assignments in the hex 100000+ range.
Ok, I didn't really mean that the mentioned extension will
rely on Unicode being 16 bits. This is only a size, where
you don't really want to build an exhaustive transition
table anymore.
Manuel