I downloaded and compiled the standalone TRE agrep command line program, and I think I have a slightly better idea of what's going on. Basically R's agrep, like the command line tool, is matching all strings that *contain* the pattern. So, essentially, insertions before and after the pattern is "free".
As far as I can tell, there isn't an option to require full-string matches using the TRE library. It should be possible to not use REG_LITERAL and surround the pattern with ^ and $, but that would require escaping all special characters in the original pattern. Is this something worth pursuing? (For my immediate needs I'll probably create a separate function that passes the regex directly to TRE without REG_LITERAL). Daniel On 11/17/10 11:47 AM, "Joris Meys" <jorism...@gmail.com> wrote: >It might have to do something with spaces and the interpretation of >insertions, as far as I understand the following examples : > >> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1)) >[1] 1 2 3 4 >> agrep("x ",c("x ","xy ","xyz ","xyza"),max=list(all=1)) >[1] 1 >> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1))) >[1] 1 2 3 4 >> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1))) >[1] 1 2 3 4 >> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=2))) >[1] 1 >> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=3))) >[1] 1 > >If the sequences are made the same length in spaces, this function >gives the expected result in the second example, but it definitely >doesn't do that any more when you start playing around with >insertions. If not a bug, it definitely behaves pretty weird... > >Cheers >Joris > >On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel ><ddicki...@carnegielearning.com> wrote: >> I posted this yesterday to r-help and Ben Bolker suggested reposting it >> here... >> >> Dickison, Daniel <ddickison <at> carnegielearning.com> writes: >> >>> >>> The documentation for agrep says it uses the Levenshtein edit distance, >>> but it seems to get this wrong in certain cases when there is a >>> combination of deletions and substitutions. For example: >>> >>> > agrep("abcd", "abcxyz", max.distance=1) >>> [1] 1 >>> >>> That should've been a no-match. The edit distance between those >>>strings >>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance >>>>= >>> 1. >>> >>> I didn't find anything in the bug database, so I was wondering if >>>somehow >>> I'm misinterpreting how agrep works. If not, should I file this in >>> Bugzilla? >>> >> >> Could you re-post this on r-devel? It definitely sounds like >> this is worth following up. Based on a little bit of playing around, >> it's quite clear that I don't understand what's going on. The examples >> show things like >> >> agrep("lasy","lazy",max=list(sub=0)) >> >> which makes sense, but >> >> agrep("lasy","lazybc",max=1) >> agrep("lasy","lazybc",max=0.001) >> agrep("lasy","layt",max=list(all=1)) >> >> and >> >> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2)) >> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2)) >> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2)) >> >> all give "1 2 3 4" ?? >> >> this makes it clear that I really don't understand what's going on >> based on the documentation. I tried to trace into the C code >> (which calls functions from the TRE regexp library) but that didn't >> help much ... >> >> >> >> Daniel Dickison >> Research Programmer >> ddicki...@carnegielearning.com >> Toll Free: (888) 851-7094 x103 >> FAX: (412) 690-2444 >> >> Revolutionary Math Curricula. Revolutionary Results. >> >> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA >>15219 >> www.carnegielearning.com >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > > >-- >Joris Meys >Statistical consultant > >Ghent University >Faculty of Bioscience Engineering >Department of Applied mathematics, biometrics and process control > >tel : +32 9 264 59 87 >joris.m...@ugent.be >------------------------------- >Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel