Re: [R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Wacek Kusnierczyk Sat, 13 Dec 2008 16:29:34 -0800

Greg Snow wrote:
> Wacek,
>
>  I am curious as to why Brian and I (and possibly other responders) are held 
> to a higher standard than the original poster.
>   
(we have just had an offline communication, it should be fine to keep it
that way)


> My first question was a real question.  There are 2 main ways to do regular 
> expression matching (possibly others as well), you describe one method with 
> the pointer moving through the string to be matched, the other way moves the 
> pointer through the pattern looking for all possible matches in the string 
> that match so far (one method is DFA the other NFA, but I don't remember 
> which is which and my copy of Friedl is at work and I'm not).  Perl and PCRE 
> use the method that you describe, but the other method may be more efficient 
> for finding all overlapping matches in some cases, but as far as I know (and 
> there are plenty of things I don't know, including all the changes/new 
> programs since I last read about this) the only programs that use the other 
> type of matching don't return the actual matches, just a yes/no on at least 
> one match.  So if the original poster has formed his opinion based on another 
> program that uses that type of matching, I would be interested to know about 
> i!
 t.  It would teach me something and also make it easier for all of us to 
better help the original poster in the future if we understand where he is 
coming from.

that's right, there are the dfa and the nfa approaches (and more), and
the latter is the regex-driven approach of perl and many other
implementations.  they work differently, and also differ in what you can
do with them.

i was talking about updating the string pointer to the position where
the next match in a global (iterative) matching process can start. 
while the nfa approach moves, in general, back and forth through the
string, and the dfa approach steps through the string linearly keeping a
list of possible matches, the end result is the same:  if there is a
match, the next match will not start earlier than after the end of the
current match.  what a dfa and an nfa engine will actually match given a
text and a pattern may differ:

perl -e 'print "ab" =~ /a|ab/g'
# a

echo ab | egrep -o 'a|ab'
# ab

but none of them will report overlapping matches:

perl -e 'print "aaaaa" =~ /aa/g
# 2 matches

echo aaaaa | egrep -o aa
# 2 matches

(afaik, egrep uses a posix-compliant dfa engine) 

to achieve the effect of overlapping matches, you either need to
manually move the pointer while the global match proceeds, or
sequentially perform single matches on successive substrings of the
input string (which can give you the same match more than once,
though).  it appears that my earlier suggestion was flawed, the
following is a bit cleaner:

$string = # some string
$pattern = # some pattern

@matches = ();
while ($string =~ /$pattern/g) { push @matches, [$-[0], $&];
pos($string) -= (length($&) - 1) }

after each successful match, it moves to the position right after the
start of the successful match.


the following will capture all possible matches for all alternatives in
the regex (or so it seems):

$string =~/(?:$pattern)(??{ push @matches, [$-[0], $&] })/g

so that "aabb" =~ /(?:a|abb?)(??{ push @matches, [$-[0], $&] })/g will
give 4 matches.

again, not sure if this can be done within r.

vQ

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Reply via email to