On Tue, 10 Jan 2006, McGehee, Robert wrote: > I would throw a tolower() around s1 and s2 so that 'canada' matches with > 'CANADA', and perhaps consider using a Levenshtein distance rather than > the longest common subsequence. > > An algorithm for Levenshtein distance can be found here (courtesy of > Stephen Upton) > https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html
Or even ?agrep - uses Levenshtein edit distance and has an argument for ignoring case. First hit in RSiteSearch("fuzzy match"), by the way. > > Robert > > -----Original Message----- > From: Werner Wernersen [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 10, 2006 2:00 PM > To: Gabor Grothendieck > Cc: r-help@stat.math.ethz.ch > Subject: Re: [R] matching country name tables from different sources > > Thanks for the nice code, Gabor! > > Unfortunately, it seems not to work for my purpose, confuses lots of > countries when I compare two lists of over 150 countries each. > Do you have any other suggestions? > > > > Gabor Grothendieck <[EMAIL PROTECTED]> schrieb: If they were the > same you could use merge. To figure out > the correspondence automatically or semiautomatically, try this: > > x <- c("Canada", "US", "Mexico") > y <- c("Kanada", "United States", "Mehico") > result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) > result[] <- sapply(result, nchar) > # try both which.max and which.min and if you are lucky > # one of them will give unique values and that is the one to use > # In this case which.max does. > apply(result, 1, which.max) # 1 2 3 > > # calculate longest common subsequence between 2 strings > lcs2 <- function(s1,s2) { > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y > # Make sure args are strings > a <- as.character(s1); an <- nchar(s1)+1 > b <- as.character(s2); bn <- nchar(s2)+1 > > > # If one arg is an empty string, returns the length of the other > if (nchar(a)==0) return(nchar(b)) > if (nchar(b)==0) return(nchar(a)) > > > # Initialize matrix for calculations > m <- matrix("", nrow=an, ncol=bn) > > for (i in 2:an) > for (j in 2:bn) > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") > else > longest(m[i-1,j], m[i,j-1]) > > # Returns the distance > m[an,bn] > } > > > > On 1/10/06, Werner Wernersen > wrote: > > Hi, > > > > Before I reinvent the wheel I wanted to kindly ask you for your > opinion if there is a simple way to do it. > > > > I want to merge a larger number of tables from different data sources > in R and the matching criterium are country names. The tables are of > different size and sometimes the country names do differ slightly. > > > > Has anyone done this or any recommendation on what commands I should > look at to automize this task as much as possible? > > > > Thanks a lot for your effort in advance. > > > > All the best, > > Werner > > > > > > > > --------------------------------- > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > > > > > --------------------------------- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > -- Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43 e-mail: [EMAIL PROTECTED] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html