[REBOL] Regular Expressions Re:

KGD03011 Thu, 6 Jan 2000 18:12:21 -0800

Hi Ole, Petr,

Thank you very much for your comments.

Ole, you suggested this to find a word ending in "ing":

>> a: [skip a | "ing"]
== [skip a | "ing"]
>> parse "ringin" a
== false
>> parse "ringing" a
== true

That's fine, except it won't detect a word ending in "ing" in the
middle of the string:

>> parse "ringing bells" a
== false

Actually, I think your suggestion is equivalent to the rule:

    b: [to "ing"]

Also, recursion is not so good for very long strings:

>> str: "ringing"
== "ringing"
>> parse str a
== true                             ; OK
>> insert str "- - - - -"
== "ringing"
>> parse str a
== true                             ; still OK
>> insert/dup str "- - - - -" 10
== "- - - - -ringing"
>> parse str a
== true                             ; still OK
>> insert/dup str "- - - - -" 100
== {- - - - -- - - - -- - - - ...
>> parse str a
== false                            ; stops working

I expected a stack overflow error, but it just stopped working. Bug?

Anyway, it's best to use SOME. The way I suggested before, gets
pointers to the beginning of all the words:

pp: copy []
alpha: make bitset!  [#"a" - #"z" #"A" - #"Z" #"_"]
non-alpha: complement alpha

    pp: copy []
    alpha: make bitset!  [#"a" - #"z" #"A" - #"Z" #"_"]
    non-alpha: complement alpha

    parse/all string [ some [
        p: some [ "ing" [non-alpha | end ] (append pp p) | alpha ] |
        some non-alpha
    ]]

This could have been done with a faster rule:

>> str: "I'm singing in the rain"
== "I'm singing in the rain"
>> rule: [some [ thru "ing" p: [[non-alpha | end ] (append pp p) | none ]]]
== [some [thru "ing" p: [[non-alpha | end] (append pp p) | none]]]
>> parse/all str rule
== false
>> pp
== [" in the rain"]

But it's still not very simple. (Note - PARSE doesn't have to return
TRUE for it to do what you want - and if the location of spaces is
relevant to what you're looking for remember to use the /ALL
refinement - keep forgetting that!)

Here's a way to find "straw" and "camel" not separated by punctuation:

    rule: [some [
        to "straw" p: "straw" some ["camel" to end (print p) | non-punc]
    ]]

>> parse "This is the straw that broke the camel's back" rule
straw that broke the camel's back
== true
>> parse "I need some straw, and a camel" rule
== false

This is also very complex with PARSE, but trivial with regular
expressions.

Ole wrote:

>Converting "(string)*" and "(string)+", for example, into this:
>
>some-string: ["string" some-string | "string"]
>any-string: ["string" any-string | none]
>
>should work, shouldn't it? But yes, I must admit, too, that 'any and 'some
>ought to do some backtracking - in fact, I believed they did, so please
>ignore the mail I sent a couple of days ago on how to easily build Parse
>blocks from regular expressions ;-)
>
>Rebol Tech., isn't the lack of backtracking in 'any and 'some a bug?

I'm not sure what the advantage of your suggestion would be over
converting "(string)*" to [any "string"] and "(string)+" to [some
"string"], but at any rate I'm sure the lack of backtracking is not a
bug. (I've corresponded with Bo on this, and he's taught me a lot.)
The trouble is, backtracking can be very hard to control. For
example, if you want to parse out a quoted string, in which the " can
be escaped, but the escape character \ can itself be escaped, it's
hard to write a regex that won't find some way of breaking through
the final " and go right on to the next quoted string. And once you
do get an expression that should work, depending on the type of
engine used it can take a long time to match.

This is relatively straightforward with PARSE:

>> escape: [{\} skip]
== ["\" skip]
>> not-quote: complement make bitset! {"}
== make bitset! #{
FFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
}
>> quoted-str: [{"} any [escape | not-quote] {"}]
== [{"} any [escape | not-quote] {"}]
>> find-quote: [some [copy p quoted-str (print p) to end | skip]]
== [some [copy p quoted-str (print p) to end | skip]]

>> parse/all {this is a "string with a quote"!} find-quote
"string with a quote"
== true
>> parse/all {this is a "string with an \" escaped quote"!} find-quote
"string with an \" escaped quote"
== true
>> parse/all {this is a "string with an \" escaped quote" and
{    "yet another quote"!} find-quote
"string with an \" escaped quote"
== true

The rule QUOTED-STR needs to be adjusted of course, depending on what
you want to do, and there are always problems with how to deal with
illegal strings (should you allow newlines in the middle or not etc).
Still, parse rules make this kind of work easy, and backtracking can
only be a complicating factor.

I've suggested to feedback that there be a backtracking option for
PARSE, or some kind of alternative, but obviously they've got more
important things to do right now. So, to keep myself amused, and
because I do things like searching for patterns of words in long
texts a lot, I wrote a regex simulator in REBOL (search-text.r).

>BTW, by building a DFA for a regular expression, you can get rid of
>the backtracking. However, I don't know how fast a program can build
>a DFA, it might not be that fast... but if you try to match one
>regular expression to a lot of strings, it might be worth a go. Keep
>in mind, though, that then you'll only be able to get a "matched" or
>"didn't match" answer from your expression matching, it will probably
>be very difficult to implement copying of the text inside
>subexpressions etc.

I've never read anything on how DFA's actually work, but it sounds real
complicated. Search-text.r implements an NFA, of course, but even then I
haven't figured out to copy text matching portions of the expression. If
anyone would give me some hints, I'd be grateful.

Cheers,
Eric
[REBOL] Regular Expressions Re:

Reply via email to