Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Tue, 22 Apr 2014 06:56:30 -0700

On Mon, Apr 21, 2014 at 7:51 PM, William ML Leslie <
[email protected]> wrote:


> This doesn't compile down to C or equivalent yet (I also didn't bother
> with bytecode - I don't think tokenising needs to take into account
> mixfix, I would rather handle it in the lexer)
>

Use of bytecode for this has nothing to do with mixfix.

The generator (not the tokenizer) only needs to take mixfix into account in
a *very* limited way: it needs to record which REs correspond to identifier
classes from which mixfix tokens are drawn. Also, it helps if the generated
lexer has a way to push and pop keyword tables.

The only impact on the NFA/DFA is something you want to do anyway: notice
that keywords are drawn from identifier spaces and implement an exception
table.

So the NFA/DFA construction is unchanged, but it gets run twice to optimize
keyword recognition.


> > Code Points vs UTF-8 Strings
> >
> > If you match code points, you soon conclude that you need a way to define
> > sparse sets of transitions. The underlying problem is that most of the
> > Unicode character classes are not densely encoded.
>
> I think I still disagree - a precomputed byte-array that can represent
> eight properties that you might test is only about 12 kB, and once you
> have a codepoint, you can simply index and mask.


I think your math is wrong.

The Unicode codepoint space has 2^21 elements. A simple bitmap for a
*single* property therefore occupies 2^18 bytes, or 256 kilobytes. This can
be optimized some: you don't need to implement trailing zeros in the bitmap.

Trust me: a table of ranges is *much* smaller than a precomputed byte array
for this. I need to get the code up where people can see it.


>  There is one other
> thing you need to keep track of when attempting to match several regex
> at once:  You need to check existing DFA branches because those that
> match a specific character and those that match something like a class
> can overlap.  But you already needed to do most of that work anyway.
>

In an NFA this works. In a DFA it doesn't, because a DFA requires that
every character transition uniquely. So the process of converting to a DFA
causes the character classes to get expanded.

To see why this is necessary, consider the following set of tokens to be
matched:

"ab"      -> id1
"\p{L}"  -> id2


The L class is the class of lowercase unicode characters. The transition on
'a' really needs to go to a different state than the transition on \p{L}.
This means that the DFA converter actually needs to be able to unpack
unicode classes. At the moment, I'm not even preserving them in the NFA. I
go ahead and expand them into their constituent subranges. This makes
character class negation easier in any case.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to