[REBOL] Regular Expressions Re:

KGD03011 Tue, 4 Jan 2000 20:19:52 -0800

Hi Keith,

Thanks for looking at search-text.r!

You wrote:

>About the backtracking. I'm not sure how essential backtracking is, because
>I think we can simulate a bunch of what backtracking does with things like
>"any" and "some" rules for 'parse. I don't know all that much about
>backtracking though, so maybe it's a lot more essential than I think. The
>thing is that I don't think Perl itself had backtracking until version 5,
>and I think there have been regex libraries which have at least been useful
>which haven't had full support for backtracking.

I think backtracking is an essential part of regular expressions. To quote
from Mastering Regular Expressions by Jeffrey Friedl (O'Reilly, p. 102),

    The essence of an NFA engine [the regex matching engine of the kind
    that Perl uses] is this: it considers each subexpression or component
    in turn, and whenever it needs to decide between two equally viable
    options, it selects one and remembers the other to return to later if
    need be. ....

Suppose you want to find a word ending in "ing". In Perl you could write:

   /\b[a-z_]+ing\b/i

Backtracking is crucial here, because "[a-z_]+" first matches to the end of
the word, and if you didn't backtrack, "ing" would never match. With a parse
rule, you have to do something like this:

    pp: copy []
    alpha: make bitset!  [#"a" - #"z" #"A" - #"Z" #"_"]
    non-alpha: complement alpha

    parse/all string [ some [
        p: some [ "ing" [non-alpha | end ] (append pp p) | alpha ] |
        some non-alpha
    ]]

This will leave you with the block PP holding pointers to the beginning of
all the words ending in "ing". It isn't quite equivalent to the Perl regex,
but it's close enough. Here backtracking is accomplished in the following
way: Suppose you were scanning through the word "ringing". The string "ing"
would match "ringing" first, and then attempt to match a non-alphabet
              ~~~
character or the string end. That wouldn't work, but the following | says
that we should try to jump back to the second letter of "ringing" and just
match any alphabet character, and keep going through the word. This is the
kind of thing that regexes do for you automatically.

Now for doing things like parsing an HTML file, backtracking is a nuisance.
If you want to match a quoted string, which may have escaped quote characters
in the middle, it's very hard to write regexes that won't find some way of
jumping from one quoted string to another and matching too much. Regexes want
to match as much as possible, so you have to fight with them sometimes to
keep them from matching too much. With parse rules, you can just look up the
grammar of the type of file you're trying to parse, and build up your parse
rules from the simplest units to the most complex structures. This can be
done recursively, which makes matching nested parens and such possible. So
that's definitely the way to go.

But if you want to search for a pattern in text, such as {a word begining
with "t" and ending with "gh"}, or {the words "straw" and "camel" not
separated by any punctuation}, and especially if you want to do that
interactively without stopping to think how you're going to program it, the
parse rules that would be needed are much too complex. This is the problem
I've been trying to tackle with search-text.r.

I really like the grammar of parse rules, so I've tried to keep as much of it
as possible. It's not as compact, but it would be trivial to rewrite the
script to provide abbreviations such as * for ANY. It would be pretty easy
even to provide automatic translation of character classes such as [a-z] into
bitsets.

I'm still thinking about what the best form of the return value should be.
For the time being I'm copying AWK, and returning a block with the equivalent
of RSTART - the index of the beginning of the match - and RLENGTH - the
length of the match. Then I decided just for the hell of it to throw in a
copy of the match to make it easier to see what was going on.

The big problem is speed. When processing the expression given, an optimized
function is created that tries to quickly identify potential matches.
Unfortunately for an expression that begins with ANY there's no simple way to
do this, so the main matching function is called for every character in the
string. Needless to say that is useless for most purposes, but in interactive
searching I don't think most people would use such an expression. Replacing
text is another matter, but then that won't be much use until I get text
capture of portions of the expression implemented.

See you around,
Eric
[REBOL] Regular Expressions Re:

Reply via email to