[bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Wed, 16 Apr 2014 22:57:07 -0700

I have a confession to make: I'm an idiot. If you've been on this list for
any length of time you already know this, but I thought it might be a good
idea to warn any relative newcomers. Today, I am an idiot on the subject of
regular expression processing for Unicode input.


It's curious. I know (specifically) how I went wrong, but I'm still puzzled
about how I got there. It seems to me, in hindsight, that I followed good
design principles to a bad end, and I'd be interested to hear any
suggestions that might debug this rather strange meta-level outcome.

The context here was that I diverted to build a regular expression engine
for unicode regular expressions. Several of you pointed me at existing
engines, notably RE2, but I wanted this for use in a lexical analyzer, and
that introduced two specific requirements that don't seem to be supported
well by "off the shelf" RE engines:

   1. The ability to match multiple REs simultaneously (in parallel) and
   return a preset ID indicating which one matched. In effect: to return a
   token ID in addition to the string matched. This is a "generic" capability.
   2. The ability to "collapse" matches, so that (e.g.) we might recognize
   that the keyword "do" is matched by the generic identifier pattern, and
   that the right way (empirically) to handle this is to first match generic
   identifiers and then use table lookup to filter for keywords. This is an
   application-specific capability, but one that is well motivated for any
   number of applications, and especially so in a language that admits
   lexically scoped modifications to syntax tables (a.k.a. mixfix).

At first glance, the second one is an optimization. Given a fixed set of
alphanumeric and syntactic keywords that are known at static compile time,
it's simple (though inefficient) to build an NFA/DFA that correctly picks
the keywords out of the crowd. But in a mixfix language you have lexically
scoped quasi-keywords. As a practical matter, you end up with a design
where you have a generic alphanumeric identifier matcher and a generic
syntactic identifier matcher and then you have a set of exceptional
quasi-keywords that are drawn from those spaces. This leads to an
implementation approach in which you first match the generic identifiers
and then process a (stack of) keyword tables that are treated as
exceptions. The catch is that the quasi-keyword tables can be extended at
compile time, so you don't want an implementation that relies on building
knowledge of keywords into the NFA/DFA.

So I went off and read all of Russ Cox's writings on RE2, and decided that
(a) it was too complicated to re-implement in full, and (b) I didn't need
some of what it did. In the process, I misread Russ's description. He
states at the front that RE2 matches Unicode code points. I took this as
correct, and it misled me. In fact it's completely wrong. RE2 matches UTF-8
encoded strings. Let me start my confession by explaining why this matters.

*Code Points vs UTF-8 Strings*

If you match code points, you soon conclude that you need a way to define
sparse sets of transitions. The underlying problem is that most of the
Unicode character classes are not densely encoded. A naively encoded set of
the form { (codepoint, state *) ... } requires prohibitive space. So the
first thing you do is try to come up with a clever notion of a UCS range
set and a UCS range map that maintains dense encodings where possible. That
isn't as simple as it sounds. In fact, it's frought with opportunities to
go wrong.

Conversely, if you match against UTF-8 encoded strings, then you are really
doing byte-level matching and you can revert to (relatively) dense bit-sets
to encode byte values. There are two complications with this:

   1. Architectural: this feels like a premature commitment to a particular
   representation. UTF-8 is not Unicode, and it seems like a premature
   optimization to build the RE implementation to assume that the input will
   be UTF-8. As one example, this extends poorly to Java or C#, where
   characters are represented as UCS2 code units.
   2. Negation: moving to bytes complicates character class negation.

The character class negation problem is as follows. Given an *ad hoc* character
class to be matched of the form "[a\p{Greek}]", note that some elements are
one byte long (the 'a' is ASCII) and others are multibyte. So then given
"[-a\p{Greek}]" how do you implement the negation of the character class?
At first this seems quite nasty, because in general an NFA epsilon
transitions, so negation is a pain in the butt. These may encode loops (REs
involving * and +) so negation is even more nasty than it seems at first
glance. Only today did I realize that the RE input
syntax<https://code.google.com/p/re2/wiki/Syntax>restricts negation in
a useful way. If a negation appears in well-formed
input, the subject of the negation necessarily consists of a character
class. A character class, by syntactic construction, consists of a
(recursive) union of matches. That is: it does not contain epsilon
transitions and therefore can be negated with only modest care.

The architectural problem is mainly an issue of complexity: you end up
building a largely useless UCS range map that, if you had only gone for
UTF-8 encoding in the first place, would have been completely unnecessary.
At the moment, I've got the wasted lines of code to prove it, but don't
worry, they are going away.

*NFA vs DFA vs Bytecode*

Russ does a very nice job of describing a bytecode virtual machine that he
characterizes as a "dual" of a conventional NFA. As Russ puts it: "We can
view the bytecode as an encoding of the NFA graphs into machine
instructions, or we can view the NFA graphs as the control-flow graph of
the bytecodes." From my very first reading his bytecode
machine<http://swtch.com/~rsc/regexp/regexp2.html>bothered me, but it
took me until this morning to pin down
*why* that was so. The answer is: it doesn't encode NFAs!

Formally, the distinction between an NFA and a DFA is that an (1) and NFA
may have epsilon transitions where a DFA may not, and (2) an NFA may have
multiple transitions for a given input where a DFA may not. In practice,
case [2] is usually avoided by construction. When the distinction is
expressed this way, it is fairly immediate to see that any NFA in which (1)
there are no epsilon transitions and (2) every input character at a given
state has a unique outbound transition is in fact a DFA.  That is: every
DFA is an NFA.

The problem with Russ's bytecode machine lies in the SPLIT instruction,
which is (in essence) an encoding of epsilon transitions, and in the CHAR
instruction, which does not have a condition code result. So if you want to
implement (a|b) you are obliged to use SPLIT to implement parallel
execution threads and then rely on CHAR to cause one side or the other to
fail and silently die. This captures the intuition of greedy vs. eager
matching nicely, but it *doesn't* capture the notion of deterministic
choice. Given those byte codes, there is no way to implement a state S {
'a' -> S1, 'b' -> S2} *without* relying on the SPLIT instruction. In
consequence, there is no way to efficiently encode a DFA in this bytecode
language.

I recognized this limitation right from the beginning. Because I was
focused on arriving at a *deterministic* machine from the beginning, I
rejected the bytecode approach rather than considering how it might be
improved. The irony is that expressing the problem as an NFA problem and
than converting to DFA form really is the best way to proceed. But if the
bytecode language doesn't admit this you are led away from the bytecode
approach. In consequence, you are led away from an incremental
implementation approach in which an initial, simple, slow, backtracking NFA
implementation can later be replaced by a more efficient DFA
implementation. From a development perspective this is expensive. If you
are as dumb as I was, it costs about two months in developer time. :-)

*Redesigning the Bytecode VM*

A modest change to Russ's VM bytecodes would allow it to encode DFAs.
Change the CHAR opcode from

char C: If input character is not 'C' stop current thread, else advance to
next instruction


to

char C pc: If input character is C advance to /pc/, else advance to next
instruction

fail: halt the current matching thread


This would work, but as a practical matter there are two input patterns
that tend to dominate the matching problem: sequences and sets. It's
therefore tempting to change the bytecode machine in a different way:

CLEAR: Clear the "have a match" bit
CHAR c: Set the match bit if (and only if) the character 'c' appears in the
input
ONMATCH pc: Transition to 'pc' if the match status bit is set, else advance
the PC.

FAIL:  terminate the current thread


The two approaches are functionally equivalent. The second admits a denser
encoding, but requires more context information when converting from NFA to
DFA. On the plus side, it admits the possibility of a richer set of
matching instructions for set matching, notably RANGE, ODD, and EVEN, which
encode some peculiarities of Unicode character classes efficiently.

My other metacomment about Russ's machine is that PC addresses should be
encoded using relative indices, so that subprograms can be copied without
relocation.

The advantage to the second encoding is that it is inherently denser. The
disadvantage is that DFA conversion becomes harder. Given:

split L1 L2
L1: CHAR 'a' L3
      FAIL
L2: CHAR 'a' L4
      FAIL


it is obvious that the two matches of 'a' cannot be coalesced. This is
harder to see if the code is:

CLEAR
SPLIT L1, L2
L1: ONMATCH L3
    CHAR 'a'
    FAIL
L2: ONMATCH L4
    CHAR 'a'
    FAIL


It gets a bit easier to process if we define the bytecode system in such a
way that every ONMATCH must be terminated by either a FAIL or a JMP.

*Summary*

Because I didn't process the lessons of Russ's work properly, I'm about to
throw away two months of hard work. This is more than a little annoying,
but at least I feel like I know where I went wrong. :-)


So that's my idiocy of the day. Share and enjoy. If anybody has a different
approach to suggest for a revised bytecode scheme, please step forward. I'm
not terribly interested in improvements to CHAR. There are *lots* of those.
The part that's mainly of interest at the moment is improvements to SPLIT
and ONMATCH that facilitate conversion from bytecode-encoded NFA to
bytcode-encoded DFA.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

[bitc-dev] Unicode RegExp Hell

Reply via email to