I've been building a regexp engine. It's an interesting case study.

Before you ask:

1. Yes, I know there are several really *good* regexp libraries out there,
the best (for my immediate purposes) probably being Russ Cox's RE2 engine.
But I haven't found one that supports multiple matching states. They'll
tell me "it matched", but not "what matched". Which you need for a lexical
analyzer. Though that could easily be bolted on to RE2 with modest changes.
I want to be able to say something along the lines:

RECompiler rec;
rec.add "[A-Za-z][A-za-z0-9]* with priority 2 returning 1 (token number for
"ident" keyword)
rec.add "do" with priority 1 returning 2 (token number for "do" keyword)
rec.MakeMatcher();


where the priorities are used to disambiguate keywords from identifiers.
The point is that I haven't seen any RE generator (unless it's hiding
inside Antlr) that has this notion of multiple possible results and
prioritized disambiguation.

2. Yes, I actually *do* know that a good lexical analyzer is something
better done by hand. Or at least that's the traditional view. I actually
don't believe it anymore, but that's another discussion. I certainly
wouldn't try to do a table-driven solution in the presence of UNICODE, but
a code-generated approach...

As I go, I'm thinking about how I will re-code this in BitC, and it turns
out to be a moderately interesting problem. Coding it in C++ is *also* a
moderately interesting problem.

I should explain: it's been a long time since I've written any C++ code
that needed to *free* memory in any interesting way. The Coyotos and EROS
kernels (and the KeyKOS kernel) are [de]allocation-free. Coyotos server
objects work very hard to be memoryless, so they tend to have a strict
stack discipline for allocations and can do bulk discard at the top of the
event loop. But of course *general* programs don't get to rely on those
simplifications.

It turns out that constructing NFA's is really messy. The NFA concept
itself is clear enough, but the actual conversion from regexp-as-string to
NFA is a real nuisance for a bunch of reasons:

   1. If there is a syntax error in the RE, you may have to discard part
   (but not all) of your NFA.
   2. Ignoring named REs and back-references, REs turn out to be
   hierarchical structures. There are cases where you need to clone
   (deep-copy) an existing sub-RE. NFA's are *not* hierarchical structures
   (because adding epsilon arcs has a way of connecting them in
   non-hierarchical ways). Cloning a sub-NFA is a non-trivial business, and
   especially so if you need to keep track of the allocations for
   free-on-error purposes.
   3. There isn't a clean "pool" discipline for allocating NFAs. When two
   NFAs get combined to form a third, you'd like to re-use substructure, but
   that creates inherent ambiguity about which NFA owns which nodes. Since the
   graph is cyclic, you can't use C++ boost::shared_ptr (or pure
   reference-counted GC). Doing lots of deep-copies is feasible but expensive
   (because copying cyclic graphs is a pain in the butt).

It turns out that *all* of these issues are fixable if you *don't* do the
classic RE construction. If you have reason to know that the RE is actually
valid, then you can wait until you are in MakeMatcher() to allocate all of
the NFANodes and just keep them in a local pool (because you'll be done
with them on return). So the "fix" is to first convert the regexp-as-string
into a regexp-as-acyclic-recursive-structure. Why? Because those are easy
to stick into a dictionary, easy to clone, and easy to destroy. You can
still have an error, but once again a memory pool mechanism will help.

But now we get into interfaces, patterns, and idioms.

The RE construction problem *can* be handled with interfaces. The reason is
that it is very purely modular - each RE node needs to know how to produce
an NFA graph, but it does so without caring how any other node does that.
There is no point at which any sort of a downcast is required in order to
implement this; existential encapsulation is sufficient. In this case the
set of RE-tree node types is closed, so we don't even really need
encapsulation.

But I begin to suspect that, in the absence of downcast, there are other
graph algorithms that are more difficult. Can anybody identify a case to
think about where (a) the ability to downcast really appears to be
essential, and (b) the set of underlying types is not closed? If the set of
underlying types is closed then I know how to handle downcast within
BitC-current. It's nasty, but I know how to do it. If the set is open
(think extensibility here), then RTTI becomes essential and downcast has to
be done as a language primitive (because it needs to guarantee safety of
the underlying runtime primitive).

Can anybody come up with a good example to think about as a case study here?


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to