>Hi Petr,
>
>I'm also interested in a good challenge - and I happen to be looking for an
>opportunity to practive parse rules.

Hi Elan,

I wonder if I could butt in here. I responded to Petr's message, but I think
the example I used was so complicated that everyone must have hit the delete
button. Here's the same challenge in a very simple form: Find a word (in the
conversational sense, not in the REBOL sense) beginning with "t" and ending
with "t" in the string:

  "this is a testament to be searched"

with a single call to PARSE. This is trivial to do with a Perl regular
expression:

  "this is a testament to be searched" =~ /\bt[a-z]*t\b/i;
  print $&;

prints out "testament". It's also very easy if you call PARSE twice, but even
then it's not so simple to get an index into the original string. It's pretty
hard to do with a single PARSE rule, and it would be even harder to generate
a PARSE rule given a simple expression structurally equivalent to the Perl
regex (but in an easier-to-read REBOListic form). I had dreams of doing this
- one of my first posts to this list was about this topic - but I decided it
was too difficult for me - especially if I wanted to generate equivalents of
even more complex regexes.


So, what I'd really like to have is a REBOL native that would work this way:

   >> a: "this is a testament to be searched"
   >> alpha: make bitset! [#"a" - #"z" #"A" - #"Z"]
   >> search a [ word-boundary "t" some alpha "t" word-boundary ]
   == [11 9]                         ; simulated output

returning the index and length of the portion of A that matched. I've got a
function that almost does that, only I don't have 'word-boundary implemented.

>> search a [ "t" any alpha "t" ]
== [11 9]                            ; actual output

This will match even when the "t"s are not at word boundaries, so I really
have to do this:

>> non-alpha: complement alpha
== make bitset! #{
FFFFFFFFFFFFFFFF010000F8010000F8FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
}
>> search a [ [head | non-alpha] "t" any alpha "t" [head | non-alpha] ]
== [10 11]

and here you get problems of extraneous non-alpha characters being included
in the match.


Anyway, if you're in for a real challenge, you could try to make sense out of
the program I wrote, refine it, add extra optimization, since it's very slow
unless the first few characters of the match are fixed. It's got a half-assed
implementation of a finite-state machine, loosely based on Chapter 20 of
Algorithms in C++ by Robert Sedgewick.

Eric


PS - If I might add a couple impressionistic comments on PARSE versus regular
expressions. When I was developing SEARCH I reread Mastering Regular
Expressions by Jeffrey Friedl. I was amazed at how simply many of the
contortions he describes can be done with PARSE. PARSE is wonderful for,
well, parsing data that obeys strict rules about the meaning of characters.
What PARSE is weak at, though, is finding patterns in natural language data.
Ironically, this is what regular expressions are good at.

Well, I believe that the current strengths of PARSE are crucial for REBOL's
immediate survival and success, but I'm hoping that someday we'll have the
best of both worlds.

Reply via email to