On Tue, Apr 22, 2014 at 12:26 AM, Matt Rice <[email protected]> wrote:
> On Mon, Apr 21, 2014 at 10:51 AM, Jonathan S. Shapiro <[email protected]> > wrote: > > > "[_a-zA-Z][_a-zA-Z0-9]*"). At some point I'll probably extend that to > > case-insensitive keyword matching somehow, but for now I don't need that. > > I'm not entirely sure what to make of the particularly awkward > instances of unicode case conversion from a lexers perspective, but it > seems unlikely to have keywords in particular that admit those > specific case conversions. The Unicode case conversion rules actually aren't that bad, but I was thinking about something else entirely. The mechanism for optimizing keywords is actually pretty simple. Basically, you build the NFA and then run the resulting interpreter on every keyword. You confirm that the interpreter fails because it has arrived at two token solutions: one for an identifier class and the other for the specific keyword. You now move the keyword into an exception table and remove the RE that matched that keyword from the input set. Do this until all possible keywords have been moved into the exception table. Now re-generate the NFA from the reduced set of REs (i.e. the one in which the keywords have been removed) and then generate a DFA from that. The complication from case-insensitive keywords is that you have to run all variations, and I didn't want to deal with that in the first pass. But actually I think that a simpler, one-pass approach is doable: 1. At introduction, mark all REs that might be identifier classes so that we know which ones they are. 2. At introduction, mark every keyword RE so we know which they are. Those go to an exception table and are never added to the NFA in the first place. There are ways that this approach can go wrong. Notably: it'a possible for conventional REs to cause a keyword RE to become unreachable. But it's a lot simpler. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
