Re: [bitc-dev] Unicode RegExp Hell

William ML Leslie Mon, 21 Apr 2014 19:53:18 -0700

On 17 April 2014 15:56, Jonathan S. Shapiro <[email protected]> wrote:
> I have a confession to make: I'm an idiot.


There is way too much clever going around, especially in language and UI design.

> The ability to match multiple REs simultaneously (in parallel) and return a
> preset ID indicating which one matched. In effect: to return a token ID in
> addition to the string matched. This is a "generic" capability.

I have been thinking about this thing, too:

https://bitbucket.org/william_ml_leslie/match/src/tip/aurochs/test/test_tokeniser.py?at=default

I figured I could just maintain the order that the NFA nodes were
visited.  For this tokeniser, this is definition order - so if you
define "do" before "[a-zA-Z_][a-zA-Z0-9_]*", it will match the "do"
token first, so that is the token "id" it emits.  When desugaring from
a grammar in BNF form, I could arrange for the tokeniser to consider
the left- or right-recursive definitions and attempt to work along
with the lexer.

This doesn't compile down to C or equivalent yet (I also didn't bother
with bytecode - I don't think tokenising needs to take into account
mixfix, I would rather handle it in the lexer) but I just wanted to
show you how I thought this problem might be tackled.

> Code Points vs UTF-8 Strings
>
> If you match code points, you soon conclude that you need a way to define
> sparse sets of transitions. The underlying problem is that most of the
> Unicode character classes are not densely encoded. A naively encoded set of
> the form { (codepoint, state *) ... } requires prohibitive space. So the
> first thing you do is try to come up with a clever notion of a UCS range set
> and a UCS range map that maintains dense encodings where possible. That
> isn't as simple as it sounds. In fact, it's frought with opportunities to go
> wrong.

I think I still disagree - a precomputed byte-array that can represent
eight properties that you might test is only about 12 kB, and once you
have a codepoint, you can simply index and mask.  There is one other
thing you need to keep track of when attempting to match several regex
at once:  You need to check existing DFA branches because those that
match a specific character and those that match something like a class
can overlap.  But you already needed to do most of that work anyway.

I guess that's something to try some other day, once I have this
compiling down to C.

-- 
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law.  You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in.  Any attempt to deny you those rights would be illegal without
prior contractual agreement.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to