Re: [bitc-dev] Unicode RegExp Hell

William ML Leslie Wed, 30 Apr 2014 22:14:27 -0700

On 29 April 2014 03:25, Jonathan S. Shapiro <[email protected]> wrote:
>
> You can do this with character class bytecodes, but only if they are
> non-consuming. If you have non-consuming match bytecodes, then you can
> encode something like
>
> if (peek() IN \p{Ll}) AND (peek() IN \p{Greek} THEN MoveToState s
>
>
> and believe it or not, this actually makes sense as a practical
> implementation in a real lexer. If you were going to code this by hand, you
> certainly wouldn't go and expand out the character classes looking for an
> optimization (though you might do special cases).
>
> And notice that shifting the burden of consumption to the MoveToState
> operation actually can be viewed as making sense in a DFA. the act of
> matching can be viewed as happening in the node, or it can be viewed as
> happening in the outbound transition. Since the two happen in lockstep,
> either one is a perfectly good way of handling things.
>
> But this tends to point us in an awkward direction: the bytecode machine for
> the DFA now needs non-consuming matching operators. That either means I need
> to go back and rethink the bytecode machine for NFAs or I need two different
> bytecode machines for the two different classes of machine.


This is already true in several ways, some nasty, some obvious.

The simplest observation we can have about this is that sometimes you have, say

abcd ::= "a[bc]*d"
abb ::= "ab*"

While attempting to match the string "abcbcbcq", you need to read
until the q to notice that abcd does not match.  However, whatever
means you're using to search for abb needs not to go past the first c.
 So, although the engine somehow read up to the q, tokenisation must
continue from before the c.  So features like forking the stream
actually already need to exist.

>From a practical perspective, I have to handle cases like "foo(",
where the ( needs to be emitted as its own token.  I've defined an
outbound action on the identifier definition - effectively saying, if
the next state is one of the valid token breaks, we should emit the
identifier token.  However, I do not want the paren within that token.

I have been building a very nice and descriptive example of how this
can be described and handled, but I've got some failing tests that I
need to fix first.  Once I hammer down the details of the problem,
I'll let you know what they were.

> * Duff's Device
>
> Ivan Goddard recently reminded me in private email that high-performance
> lexical analyzer generators tend to process the input in words rather than
> bytes, using a variant of Duff's Device to copy the input bytes. For
> languages that use ASCII input. In effect, the search process is "unrolled"
> and word comparisons are used rather than byte comparisons. The current
> engine that I have doesn't even try to do this, but it isn't hard to see how
> it might be done.

Interesting.  I can easily see speculation done with SIMD or so, but
maybe people are doing even smarter things than that?

-- 
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law.  You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in.  Any attempt to deny you those rights would be illegal without
prior contractual agreement.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to