Re: [bitc-dev] Unicode RegExp Hell

William ML Leslie Tue, 22 Apr 2014 22:40:18 -0700

On 22 April 2014 23:55, Jonathan S. Shapiro <[email protected]> wrote:
> On Mon, Apr 21, 2014 at 7:51 PM, William ML Leslie
> <[email protected]> wrote:
>>
>> This doesn't compile down to C or equivalent yet (I also didn't bother
>> with bytecode - I don't think tokenising needs to take into account
>> mixfix, I would rather handle it in the lexer)
>
>
> Use of bytecode for this has nothing to do with mixfix.


Ok.

>> > Code Points vs UTF-8 Strings
>> >
>> > If you match code points, you soon conclude that you need a way to
>> > define
>> > sparse sets of transitions. The underlying problem is that most of the
>> > Unicode character classes are not densely encoded.
>>
>> I think I still disagree - a precomputed byte-array that can represent
>> eight properties that you might test is only about 12 kB, and once you
>> have a codepoint, you can simply index and mask.
>
>
> I think your math is wrong.
>
> The Unicode codepoint space has 2^21 elements. A simple bitmap for a single
> property therefore occupies 2^18 bytes, or 256 kilobytes. This can be
> optimized some: you don't need to implement trailing zeros in the bitmap.

Oh, indeed. 12kb doesn't even encode a single plane.  I don't remember
what I was doing when I came up with that figure.

> Trust me: a table of ranges is much smaller than a precomputed byte array
> for this. I need to get the code up where people can see it.

Or maybe a BDD?

>>  There is one other
>> thing you need to keep track of when attempting to match several regex
>> at once:  You need to check existing DFA branches because those that
>> match a specific character and those that match something like a class
>> can overlap.  But you already needed to do most of that work anyway.
>
>
> In an NFA this works. In a DFA it doesn't, because a DFA requires that every
> character transition uniquely. So the process of converting to a DFA causes
> the character classes to get expanded.
>
> To see why this is necessary, consider the following set of tokens to be
> matched:
>
> "ab"      -> id1
> "\p{L}"  -> id2
>
>
> The L class is the class of lowercase unicode characters. The transition on
> 'a' really needs to go to a different state than the transition on \p{L}.
> This means that the DFA converter actually needs to be able to unpack
> unicode classes. At the moment, I'm not even preserving them in the NFA. I
> go ahead and expand them into their constituent subranges. This makes
> character class negation easier in any case.

I had been expanding them only for the ascii subset, as the only
matches I was using outside that range were always in the form of a
class.  Otherwise, given N possibly overlapping matches, I built the
continuation once for each subset of the target states.

-- 
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law.  You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in.  Any attempt to deny you those rights would be illegal without
prior contractual agreement.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to