I posted this yesterday to r-help and Ben Bolker suggested reposting it here...
Dickison, Daniel <ddickison <at> carnegielearning.com> writes: > > The documentation for agrep says it uses the Levenshtein edit distance, > but it seems to get this wrong in certain cases when there is a > combination of deletions and substitutions. For example: > > > agrep("abcd", "abcxyz", max.distance=1) > [1] 1 > > That should've been a no-match. The edit distance between those strings > is 3 (1 substitution, 2 deletions), but agrep matches with max.distance >>= > 1. > > I didn't find anything in the bug database, so I was wondering if somehow > I'm misinterpreting how agrep works. If not, should I file this in > Bugzilla? > Could you re-post this on r-devel? It definitely sounds like this is worth following up. Based on a little bit of playing around, it's quite clear that I don't understand what's going on. The examples show things like agrep("lasy","lazy",max=list(sub=0)) which makes sense, but agrep("lasy","lazybc",max=1) agrep("lasy","lazybc",max=0.001) agrep("lasy","layt",max=list(all=1)) and agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2)) agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2)) agrep("x",c("x","xy","xyz","xyza"),max=list(all=2)) all give "1 2 3 4" ?? this makes it clear that I really don't understand what's going on based on the documentation. I tried to trace into the C code (which calls functions from the TRE regexp library) but that didn't help much ... Daniel Dickison Research Programmer ddicki...@carnegielearning.com Toll Free: (888) 851-7094 x103 FAX: (412) 690-2444 Revolutionary Math Curricula. Revolutionary Results. Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219 www.carnegielearning.com ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel