Parsing Unicode and lexbase

drkameleon Thu, 01 Dec 2022 02:30:17 -0800

**Question/food-for-thought:**

What would be the most sensible way to go about support unicode values in a 
lexer using `lexbase` / BaseLexer as its basis?


I mean, as far as I can tell, using `BaseLexer` we can actually load up our 
string/input into a buffer and move through it byte-by-byte:
    
    
    while true:
            setLen(p.value, 0)
            
            case p.buf[p.bufpos]
                 of someChar:
                       ...
    
    
    Run

But the whole thing becomes quite more complicated when this "char" we are 
after, even if it's a single one, is a Unicode character, in which case I end 
up testing for a series of bytes (like `p.buf[p.bufpos]` and 
`p.buf[p.bufpos+1]`, etc)

Here's an example of what I'm talking about: 
<https://github.com/arturo-lang/arturo/blob/master/src/vm/parse.nim#L945-L967>

...which looks rather ugly, plus not very easy to debug and reason about.

So, how would you go about it?

Parsing Unicode and lexbase

Reply via email to