I was aware of that which is why I mentioned that it is usually used for matching last names rather than countries and noted possible need to modify the algorithm slightly. soundex is a relatively simple algorithm so its not too hard. For example, one could just code the first letter too.
On 1/11/06, SAULEAU Erik-André <[EMAIL PROTECTED]> wrote: > > > dear all, > > yes but the problem with soundex for example is that it does not work when > an error occur in the first place (Canada vs Kanada) as it keeps the fist > character. It seems that you have to look after an approximate string > matching algorithm (for example, a very good one if from Porter-Jaro and > Winkler at the US Census bureau or have o look to the book of Navarro about > classification of algorithm). > > HTH and an happy new year, erik. > > > -----Message d'origine----- > De: Gabor Grothendieck > A: Werner Wernersen > Cc: [email protected] > Date: 10/01/2006 21:16 > Objet: Re: [R] matching country name tables from different sources > > > One other thing to try could be soundex. ITs normally used for > last names but it might work here too. Google to find the > soundex encoding rules. Reviewing the country names might > suggest minor modifications to the soundex algorithm to > improve it for your case. > > On 1/10/06, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > > You can improve it somewhat by first accepting all the largest > > matches and removing the rows and columns for those and > > repeatedly doing that with what is left. > > > > On 1/10/06, Werner Wernersen <[EMAIL PROTECTED]> wrote: > > > Thanks for the nice code, Gabor! > > > > > > Unfortunately, it seems not to work for my purpose, confuses lots of > > > countries when I compare two lists of over 150 countries each. > > > Do you have any other suggestions? > > > > > > > > > > > > Gabor Grothendieck <[EMAIL PROTECTED]> schrieb: > > > If they were the same you could use merge. To figure out > > > the correspondence automatically or semiautomatically, try this: > > > > > > x <- c("Canada", "US", "Mexico") > > > y <- c("Kanada", "United States", "Mehico") > > > result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) > > > result[] <- sapply(result, nchar) > > > # try both which.max and which.min and if you are lucky > > > # one of them will give unique values and that is the one to use > > > # In this case which.max does. > > > apply(result, 1, which.max) # 1 2 3 > > > > > > # calculate longest common subsequence between 2 strings > > > lcs2 <- function(s1,s2) { > > > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y > > > # Make sure args are strings > > > a <- as.character(s1); an <- nchar(s1)+1 > > > b <- as.character(s2); bn <- nchar(s2)+1 > > > > > > > > > # If one arg is an empty string, returns the length of the other > > > if (nchar(a)==0) return(nchar(b)) > > > if (nchar(b)==0) return(nchar(a)) > > > > > > > > > # Initialize matrix for calculations > > > m <- matrix("", nrow=an, ncol=bn) > > > > > > for (i in 2:an) > > > for (j in 2:bn) > > > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) > > > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") > > > else > > > longest(m[i-1,j], m[i,j-1]) > > > > > > # Returns the distance > > > m[an,bn] > > > } > > > > > > > > > > > > On 1/10/06, Werner Wernersen wrote: > > > > Hi, > > > > > > > > Before I reinvent the wheel I wanted to kindly ask you for your > opinion if > > > there is a simple way to do it. > > > > > > > > I want to merge a larger number of tables from different data > sources in R > > > and the matching criterium are country names. The tables are of > different > > > size and sometimes the country names do differ slightly. > > > > > > > > Has anyone done this or any recommendation on what commands I > should look > > > at to automize this task as much as possible? > > > > > > > > Thanks a lot for your effort in advance. > > > > > > > > All the best, > > > > Werner > > > > > > > > > > > > > > > > --------------------------------- > > > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu > PC! > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > [email protected] mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > > > > ________________________________ > > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu > PC! > > > Jetzt Yahoo! Messenger installieren! > > > > > > > > > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > ************************************************************************ > ********** > Afin d'eviter toute propagation de virus informatique, et en complement > des dispositifs en place, ce message (et ses pieces jointes s'il y en a) > > a ete automatiquement analyse par un antivirus de messagerie. > ************************************************************************ > ********** > > ********************************************************************************** > Afin d'eviter toute propagation de virus informatique, et en complement > des dispositifs en place, ce message (et ses pieces jointes s'il y en a) > a ete automatiquement analyse par un antivirus de messagerie. > ********************************************************************************** > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
