> something I have wanted to do is modify Alex so that â turns into the
> regular expression 0xe2 0x88 0x80 (and so forth) so that ghc (whose
> lexer is generated from alex) can simply accept utf8 input. 

I also really want to get GHC accepting UTF-8 source files, but I don't think this is 
the best way to go about it.

Sure, you can run Alex over the UTF-8 source, but the grammar will be huge.  A simpler 
way is to take advantage of the fact that Haskell only uses 5 classes of Unicode 
characters: uniSmall, uniLarge, uniWhite, uniSymbol, and uniDigit.  Alex has a good 
input abstraction behind which you can hide the translation from UTF-8 to Char, so you 
can map these 5 classes of unicode characters onto 5 special Char values, and use Alex 
unmodified.

Well, perhaps Alex will need a small modification so that its upper bound on Char 
values is variable (currently it is fixed at 255).

Then you have to think about whether GHC keeps strings internally in UTF-8 or expanded 
unicode.  Perhaps UTF-8 is initially easier (not much change to the FastString type), 
but this might have further ramifications.

Hmmm... looks like a good project to put on the GHC Task List!

Cheers,
        Simon
_______________________________________________
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply via email to