> something I have wanted to do is modify Alex so that â turns into the
> regular expression 0xe2 0x88 0x80 (and so forth) so that ghc (whose
> lexer is generated from alex) can simply accept utf8 input.
I also really want to get GHC accepting UTF-8 source files, but I don't think this is
the best way to go about it.
Sure, you can run Alex over the UTF-8 source, but the grammar will be huge. A simpler
way is to take advantage of the fact that Haskell only uses 5 classes of Unicode
characters: uniSmall, uniLarge, uniWhite, uniSymbol, and uniDigit. Alex has a good
input abstraction behind which you can hide the translation from UTF-8 to Char, so you
can map these 5 classes of unicode characters onto 5 special Char values, and use Alex
unmodified.
Well, perhaps Alex will need a small modification so that its upper bound on Char
values is variable (currently it is fixed at 255).
Then you have to think about whether GHC keeps strings internally in UTF-8 or expanded
unicode. Perhaps UTF-8 is initially easier (not much change to the FastString type),
but this might have further ramifications.
Hmmm... looks like a good project to put on the GHC Task List!
Cheers,
Simon
_______________________________________________
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users