#5108: Allow unicode sub/superscript symbols in both identifiers and operators
-----------------------------------+----------------------------------------
    Reporter:  mikhail.vorozhtsov  |       Owner:                  
        Type:  feature request     |      Status:  patch           
    Priority:  normal              |   Milestone:  7.4.1           
   Component:  Compiler (Parser)   |     Version:  7.1             
    Keywords:  lexer unicode       |          Os:  Unknown/Multiple
Architecture:  Unknown/Multiple    |     Failure:  None/Unknown    
  Difficulty:  Unknown             |    Testcase:                  
   Blockedby:                      |    Blocking:                  
     Related:                      |  
-----------------------------------+----------------------------------------

Comment(by mikhail.vorozhtsov):

 Replying to [comment:4 simonmar]:
 > I'm not keen on this patch for a few reasons:
 >
 >  * It's inconsistent to allow superscript/subscript on symbols.  Haskell
 >    doesn't currently allow primes on symbols, for example.
 If fact, GHC already allows unicode primes on symbols. alexGetByte
 classifies OtherPunctuation characters (including primes) as `$unisymbol`.
 {{{
 $ ghci
 GHCi, version 7.2.2: http://www.haskell.org/ghc/  :? for help
 Loading package ghc-prim ... linking ... done.
 Loading package integer-gmp ... linking ... done.
 Loading package base ... linking ... done.
 Loading package ffi-1.0 ... linking ... done.
 λ> let a +′ b = a + b
 }}}
 The patch just makes sure that primes at least do not appear at the start
 of a `@varsym`. We can further restrict sub/sup characters to appear only
 in the suffix of a symbol, i.e. `@varsym = $symbol $symchar* $subsup*`.
 >  * The patch has a bunch of Unicode constants baked into it
 The same can ultimately be said about `generalCategory`, I mean look at
 `u_gencat`. I can move the sup/sub test to a separate inlinable function.
 >  * It adds a bunch of extra tests to the inner loop.  I haven't
 >    measured it but I wouldn't be surprised if this slows down the lexer.
 Hm, I don't know if a few extra comparisons on already rare unicode
 characters will outweight the binary search in `u_gencat`, let alone
 significantly increase the overall lexing time. Is there any way to stop
 GHC right after lexing so I can benchmark?
 > Perhaps it might be better just to allow the category Lm (MODIFIER
 LETTER) as part of an identifier?  That would include all the primes and
 subscript/superscript things.
 Lm leaves out a bunch of characters (e.g. sub/sup variants of "+" "-" "="
 "(" ")"), including the primes which, as I mentioned, are Po. Another
 drawback is that identifies like "abcₓdef" would be accepted. BTW, we
 already can write something not-so-beautiful like:
 {{{
 λ> let ᵤxᵤy = 1
 }}}
 because "ᵤ" is in the Ll category.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/5108#comment:5>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler

_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to