On Wed, 17 Nov 2010, Dickison, Daniel wrote:

I downloaded and compiled the standalone TRE agrep command line program,
and I think I have a slightly better idea of what's going on.  Basically
R's agrep, like the command line tool, is matching all strings that
*contain* the pattern.  So, essentially, insertions before and after the
pattern is "free".

As far as I can tell, there isn't an option to require full-string matches
using the TRE library.  It should be possible to not use REG_LITERAL and
surround the pattern with ^ and $, but that would require escaping all
special characters in the original pattern.

Is this something worth pursuing?  (For my immediate needs I'll probably
create a separate function that passes the regex directly to TRE without
REG_LITERAL).


I am joining this thread late, but I wonder if reversing agrep's 'pattern' and 'x' args serves the OP's need.

viz.

sapply( c("x","xy","xyz","xyza"),
+ function(y) any( agrep( y, "x", max=list(all=1))))
    x    xy   xyz  xyza
 TRUE  TRUE FALSE FALSE


HTH,

Chuck


Daniel

On 11/17/10 11:47 AM, "Joris Meys" <jorism...@gmail.com> wrote:

It might have to do something with spaces and the interpretation of
insertions, as far as I understand the following examples :

agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
[1] 1 2 3 4
agrep("x    ",c("x    ","xy  ","xyz ","xyza"),max=list(all=1))
[1] 1
agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
[1] 1 2 3 4
agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
[1] 1 2 3 4
agrep("xx   ",c("xx   ","xyx  ","xyzx ","xyzax",max=list(all=2)))
[1] 1
agrep("xx   ",c("xx   ","xyx  ","xyzx ","xyzax",max=list(all=3)))
[1] 1

If the sequences are made the same length in spaces, this function
gives the expected result in the second example, but it definitely
doesn't do that any more when you start playing around with
insertions. If not a bug, it definitely behaves pretty weird...

Cheers
Joris

On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel
<ddicki...@carnegielearning.com> wrote:
I posted this yesterday to r-help and Ben Bolker suggested reposting it
here...

Dickison, Daniel <ddickison <at> carnegielearning.com> writes:


The documentation for agrep says it uses the Levenshtein edit distance,
but it seems to get this wrong in certain cases when there is a
combination of deletions and substitutions.  For example:

agrep("abcd", "abcxyz", max.distance=1)
[1] 1

That should've been a no-match.  The edit distance between those
strings
is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
=
1.

I didn't find anything in the bug database, so I was wondering if
somehow
I'm misinterpreting how agrep works.  If not, should I file this in
Bugzilla?


 Could you re-post this on r-devel?  It definitely sounds like
this is worth following up.  Based on a little bit of playing around,
it's quite clear that I don't understand what's going on.  The examples
show things like

agrep("lasy","lazy",max=list(sub=0))

 which makes sense, but

agrep("lasy","lazybc",max=1)
agrep("lasy","lazybc",max=0.001)
agrep("lasy","layt",max=list(all=1))

and

agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))

 all give "1 2 3 4" ??

 this makes it clear that I really don't understand what's going on
based on the documentation.  I tried to trace into the C code
(which calls functions from the TRE regexp library) but that didn't
help much ...



Daniel  Dickison
Research Programmer
ddicki...@carnegielearning.com
Toll Free: (888) 851-7094 x103
FAX: (412) 690-2444

Revolutionary Math Curricula. Revolutionary Results.

Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA
15219
www.carnegielearning.com

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Charles C. Berry                            Dept of Family/Preventive Medicine
cbe...@tajo.ucsd.edu                        UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to