[REBOL] Regular Expressions Re:(2)

ole_f Sun, 9 Jan 2000 13:29:08 -0800
Hi Eric, 10-Jan-2000 you wrote:

>The hardest thing for me was implementing quantifiers. What I'm doing
>with search-text.r is "compiling" an expression like

>   [bound "t" 3 7 alpha "ing" bound]

>to a form like:

>   [bound "t" [4 6 3 7] make bitset! #{...} [3] "ing" bound]

>in which the blocks are nodes which tell:

>one or two elements to jump to
>the minimum number of times the first jump must be made
>the maximum number of times the first jump can be made

>So we have to go through the cycle of elements 3/4/5

>             -------------------------------
>            |                               |
>            V                               |
>"t"(2) ->  node(3) -> bitset(4) -> node(5) -
>            |
>             -------------------------------------> "ing"(6)

>three times before we even think about jumping to the "ing" element.
>What I'm doing is saving the number of times node(3) has been passed
>in another block in a position tied to the "ing" element. When there
>is a failure to match somewhere in the cycle, or when the cycle has
>been done the maximum number of times, it's important to erase the
>information on how many times node(3) has been passed, in case one loop
>is nested inside another loop and there's a chance of coming back to
>it.

>Now with the NFA engine, saving a state means you only have to push
>onto a stack the index of a jump you could have taken, plus the current
>index of the text string. But with a DFA engine, you wouldn't need the
>text string index (because once you deal with a character in the text
>stream it's done once and for all) - is that right? On the other hand,

That's right.

>every time you come to a place where two jumps are possible you have
>to start another thread so that both possible pattern indexes can be
>compared against the next character in the text stream.

You mean, with DFA's? In that case, no. Though, it's something like that you
need to do when you construct the DFA from the NFA.

>What seems especially complicated here is that you have to save the
>information on which node has been passed how many times with each
>thread. Complicated yes, but once a thread dies, it should be easier
>to delete all the information associated with it.

Once you have a DFA, you don't need to store such information. What you have
is one big system with states (nodes) and transitions (arrows between the
states, each associated with a character). You only need to remember which
state you are in.

Let's take a simple example: A DFA recognizing all binary numbers that are a
multiplum of 3 (11 binary ;-) ). State 1 is an accepting state:

      0
     |--|
     V  |
 -->State 1 (accepting)
     | ^
    1| |1
     V |
    State 2
     | ^
    0| |0
     V |
    State 3
     | ^
     |-|
      1

I hope it's clear what it says:
>From state 1: "0" moves to state 1, "1" moves to state 2.
>From state 2: "0" moves to state 3, "1" moves to state 1.
>From state 3: "0" moves to state 2, "1" moves to state 3.

We start in state 1 (that's why there's the arrow coming from the left), and
when we have a string, for example "10101" (which is 21 in decimal), you just
eat the characters from left to right, following the transitions. Since it is
a DFA (the D stands for deterministic, you know ;-) ), we will never have
more than one possible move (when we have no move for a given input
character, we reject the whole input string immediately). When we are done
with the whole string, we see whether or not we are in an accepting state. If
so, our DFA has recognized the input string. If not, our DFA has rejected the
input string.

Simple as that. As you can see, once we have built a DFA for a corresponding
regular expression, matching strings is _very_ fast and requires no
recursion, no stack, or whatever.

In fact, the only differences between a DFA are these:

1: An NFA may have more than one possible transition for certain input
characters in each state.
2: An NFA may have "the empty string" associated with a transition.

But the beauty is that given any NFA, you can construct a corresponding DFA.

>Well, that's the idea anyway. If you would take a look at the code and
>tell me any ideas you have, or if you could suggest a reference on
>DFA machines, I'd really appreciate it.

I'll try to find some electronic material on DFA's tomorrow (when I have free
bandwith from University ;-) ).

Kind regards,
--
Ole Friis <[EMAIL PROTECTED]>

"Ignorance is bliss"
(Cypher, The Matrix)
[REBOL] Regular Expressions Re:(2)

Reply via email to