Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Thu, 17 Apr 2014 11:30:40 -0700

OK. One last point occurred to me about bytecodes for DFAs. While we want a
DFA to execute as a single thread, we need to get the right result state
(i.e. token ID) in the presence of longest match. For this reason, we need
a notion that says "match the input as token T provided the input pointer
does not advance further".


I've also decided that, for the moment, I'm going to go with

CHAR 'c' <pc on match>


rather than introduce an ONMATCH instruction. The main reason for this is
that it simplifies the implementation of the NFA engine. The reason for "pc
on match" rather than "pc on fail" is that a dense sequence of CHAR
operations can be sorted and then emitted efficiently.

So here's where that leaves us for a minimal set of bytecode instructions:

SPLIT *pc1, pc2   *Initiates non-deterministic parallel execution.
                 If present, this is an NFA.
JMP *pc*           Redirects execution to some other instruction
BYTE *b* *pc*        If the input pointer points to the byte value *b*,
                 advance to the given *pc*, else go the the next
                 instruction (this replaces CHAR).
TOKEN *prio*, *id*   Indicates that we currently have a match for a
                 token *id* with priority *prio.* The current input
                 position corresponds to the end of the match.
                 If a higher priority token result already exists,
                 this opcode has no effect
STOP             Halts the current thread.
STOPALL          Halts ALL threads

I think that covers what we need. Note that STOPALL actually subsumes the
priority feature of TOKEN. This is intentional. The intent is to start with
a priority-enabled NFA and simplify that to an NFA that does not make use
of priorities.

It is an error for two threads to execute TOKEN instructions at the same
priority with distinct IDs.

Note that the NFA execution machine only needs to maintain a single
(priority, id, end pointer) tuple. That state does not need to be
maintained on a per-thread basis.

One thing that is *not* supported by this opcode set is trailing context,
e.g. match "ab" but only when followed by ':'. There are ways to do that,
but they require adding per-thread state to remember what the end pointer
was. Things like ^ $ and \W can be done here because they are
non-consuming. The problem arises when we need trailing context that
requires more than single byte lookahead.

Ah! This can be done by tracking the current input position and the
trailing match position separately. BYTE advances both pointers. PEEK (a
new instruction) only advances the input position.

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to