"Dmitry Olshansky" <[email protected]> wrote in message news:[email protected]... > On 29.09.2011 1:20, Nick Sabalausky wrote: >> >> Boy, I gotta say I'm really tempted to tackle this. I don't know if I >> *should* dedicate my already-tight time, but it's very tempting. And I >> have >> already written a generalized lexer generator in D ( >> www.semitwist.com/goldie ), so I have that experience (and codebase) to >> draw >> upon. >> > > Interesting and I almost forgot that we have lexer generator... What that > "generalized" bit applies to? Does it tackle CFG? Then that would have > been parser in my vocabulary ;) > Judging by first pages I see LALR(1) so definitely a parser. > I'm more into LL+something or PEGs. I'm liking the way e.g. ANTLR does > this, a very nice hybrid approach. >
It's both. It can do lexing and parsing, or just one or the other by themself. The parsing is LALR(1), the lexing is compiled DFA taken from regular expressions (it's not traditional PCRE-syntax, but it's basically it's a regex). >> Only big question is whether it would be best to try to make Phobos's >> existing regex engine flexible enough that it could be used by the lexer >> (since a generalized lexer is essentially a regex engine with multiple >> accept states, and optionally some customizable hooks). I've posted some >> questions to that end in another branch of this thread. >> > > To that end all what needs to be done is to restrict some wild stuff like > backreferences & lookaround (how the hell thought that was good idea?!). > Then use existing parser to get IR code for regex per each alternative, > then fuse them via thompson construction (keeping note of terminal > states). > Taking Unicode into account I'd rather not go for table driven DFA. I'd > better craft some switch statements and let the compiler sweat :) > I'm using a span-based table of code units instead of just simply an array of code units. It seems to work fine on unicode (much better than a list of code units). But yea, you could probably do better by generating switches or something to be mixed-in. I was thinking of doing that, but haven't gotten around to it.
