Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Mon, 28 Apr 2014 17:48:26 -0700

On Mon, Apr 28, 2014 at 4:37 PM, Matt Oliveri <[email protected]> wrote:


> Come to think of it, why _are_ you thinking about lexing so hard now?
> You're working on a bootstrap compiler, right? So wouldn't any old
> slow lexer with broken Unicode cut it for now? Wouldn't it be more fun
> to implement a serious lexer in BitC itself?


It's a fair question with a couple of different answers. In all likelihood
none of them are terribly compelling.

I already have a hand-crafted bootstrap lexer in C++ that does a perfectly
fine job. I actually started in to this because I need a parser generator.
And yes, I already have a YACC-based parser, and it has been a source of
problems. I had been thinking to extract all the keywords from the parser
description automatically. The catch is that (a) it's hard to take the
resulting list and usefully drop it into a hand-crafted parser, and (b) I
didn't realize going in how far PCRE has diverged from honest finite
automata. Around the time I started thinking about all of this I happened
to be reading Russ Cox's notes on RE2. Can't use that directly, because it
doesn't implement a notion of token IDs, so I figured I'd kick the can a
bit and give it a try.

The result hasn't been immediately useful for BitC, but it's been
horrifically educational for me. :-) At least now I know what the BitC
implementation is going to look like, and actually, the effort revealed
motivation for a couple of idioms that are still difficult to express in
BitC.

The sheer complexity of the PCRE specification is astonishing. That said,
it's darned useful to have a backtracking analyzer for certain kinds of
things.

The thing I didn't realize going in is that there simply *isn't* a
non-recursive implementation even for simple PCRE patterns. Sure, you can
simplify down and find cases for which recursion isn't required, but it's a
lot of bother to do that.


Jonathan

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to