I would throw a tolower() around s1 and s2 so that 'canada' matches with 'CANADA', and perhaps consider using a Levenshtein distance rather than the longest common subsequence.
An algorithm for Levenshtein distance can be found here (courtesy of Stephen Upton) https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html Robert -----Original Message----- From: Werner Wernersen [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 10, 2006 2:00 PM To: Gabor Grothendieck Cc: [email protected] Subject: Re: [R] matching country name tables from different sources Thanks for the nice code, Gabor! Unfortunately, it seems not to work for my purpose, confuses lots of countries when I compare two lists of over 150 countries each. Do you have any other suggestions? Gabor Grothendieck <[EMAIL PROTECTED]> schrieb: If they were the same you could use merge. To figure out the correspondence automatically or semiautomatically, try this: x <- c("Canada", "US", "Mexico") y <- c("Kanada", "United States", "Mehico") result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) result[] <- sapply(result, nchar) # try both which.max and which.min and if you are lucky # one of them will give unique values and that is the one to use # In this case which.max does. apply(result, 1, which.max) # 1 2 3 # calculate longest common subsequence between 2 strings lcs2 <- function(s1,s2) { longest <- function(x,y) if (nchar(x) > nchar(y)) x else y # Make sure args are strings a <- as.character(s1); an <- nchar(s1)+1 b <- as.character(s2); bn <- nchar(s2)+1 # If one arg is an empty string, returns the length of the other if (nchar(a)==0) return(nchar(b)) if (nchar(b)==0) return(nchar(a)) # Initialize matrix for calculations m <- matrix("", nrow=an, ncol=bn) for (i in 2:an) for (j in 2:bn) m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") else longest(m[i-1,j], m[i,j-1]) # Returns the distance m[an,bn] } On 1/10/06, Werner Wernersen wrote: > Hi, > > Before I reinvent the wheel I wanted to kindly ask you for your opinion if there is a simple way to do it. > > I want to merge a larger number of tables from different data sources in R and the matching criterium are country names. The tables are of different size and sometimes the country names do differ slightly. > > Has anyone done this or any recommendation on what commands I should look at to automize this task as much as possible? > > Thanks a lot for your effort in advance. > > All the best, > Werner > > > > --------------------------------- > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > > [[alternative HTML version deleted]] > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > --------------------------------- [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
