#5108: Allow unicode sub/superscript symbols in both identifiers and operators
-----------------------------------+----------------------------------------
Reporter: mikhail.vorozhtsov | Owner:
Type: feature request | Status: patch
Priority: normal | Milestone: 7.4.1
Component: Compiler (Parser) | Version: 7.1
Keywords: lexer unicode | Os: Unknown/Multiple
Architecture: Unknown/Multiple | Failure: None/Unknown
Difficulty: Unknown | Testcase:
Blockedby: | Blocking:
Related: |
-----------------------------------+----------------------------------------
Comment(by mikhail.vorozhtsov):
Replying to [comment:4 simonmar]:
> I'm not keen on this patch for a few reasons:
>
> * It's inconsistent to allow superscript/subscript on symbols. Haskell
> doesn't currently allow primes on symbols, for example.
If fact, GHC already allows unicode primes on symbols. alexGetByte
classifies OtherPunctuation characters (including primes) as `$unisymbol`.
{{{
$ ghci
GHCi, version 7.2.2: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
λ> let a +′ b = a + b
}}}
The patch just makes sure that primes at least do not appear at the start
of a `@varsym`. We can further restrict sub/sup characters to appear only
in the suffix of a symbol, i.e. `@varsym = $symbol $symchar* $subsup*`.
> * The patch has a bunch of Unicode constants baked into it
The same can ultimately be said about `generalCategory`, I mean look at
`u_gencat`. I can move the sup/sub test to a separate inlinable function.
> * It adds a bunch of extra tests to the inner loop. I haven't
> measured it but I wouldn't be surprised if this slows down the lexer.
Hm, I don't know if a few extra comparisons on already rare unicode
characters will outweight the binary search in `u_gencat`, let alone
significantly increase the overall lexing time. Is there any way to stop
GHC right after lexing so I can benchmark?
> Perhaps it might be better just to allow the category Lm (MODIFIER
LETTER) as part of an identifier? That would include all the primes and
subscript/superscript things.
Lm leaves out a bunch of characters (e.g. sub/sup variants of "+" "-" "="
"(" ")"), including the primes which, as I mentioned, are Po. Another
drawback is that identifies like "abcₓdef" would be accepted. BTW, we
already can write something not-so-beautiful like:
{{{
λ> let ᵤxᵤy = 1
}}}
because "ᵤ" is in the Ll category.
--
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/5108#comment:5>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs