[R-sig-teaching] some tools for linking 2 text vectors

pmillet Thu, 25 Sep 2008 06:00:34 -0700

I used R somehow « strangely » for a rather common problem facingprogrammers confronted with heterogeneous databases (who may or not bestatistician)=

How to link database B with Base A when both base's data are totally orpartially un-standardized text ?


At first glance, R is NOT the ideal tool for that, however =


potent string tools are included in the base R (grep, agrep)

R is multi platform (Linux, Mac, Windows)

With R it is easy to read and write dbf ( Dbase) files , ans from Dbaseto excel, access and so on

So I propose here a quite primitive tool, achieving this aim (where sometricks may be useful). Of course, programmers will make it more adaptedto their specific aims. I hope it may be useful to some of you. Somefunctions were unexpectedly difficult (how to manage empty lists, how toimport list data in a database row ) and I hope my solutions wille help.

The fist program links a Dbase file with structured and standardizedfields = NIP= index number, name=NOM, surname=PRENOM, postal code=CP anda csv file with unstructured text (identity, adress, postalcode).(yellow pages)

I use the grep (or agrep function) to find a pattern in a text.(agrepfind approached links)

I use three successive passes to find the links for name, then surname(among the result file) and then Postal code If this last list ofresults is not empty I write it in a Dbase datafile.

With the grep function you can return either the index number of the« hit » or it's value (with the option value=TRUE). The problem is thatyou return only the field who shows a link. So you look for « smith »and have in return « smith » . It's not very useful.

The trick is to use the scan function to read the csv file with thereturn character as sep (separator). Then you have in return the entireline (with ; virtually separating fields) inclunding the index numberand so on. .

However the return is a list of strings . Importing the data from a listto a database is quite tedious (see below)


library (foreign)

TT<- read.dbf ('C:\\PJ\\TFC.dbf')

T <- scan ('C:\\PJ\\TT.CSV', what='string', sep='\r')

REP <- c()

K <- 1

Sp <- ';'


LTT <- length (TT$NOM1)

for ( J in 1:LTT){



C1 <- c ()

C2 <- c ()

C3 <- c ()


CC <- data.frame( TT [J,])

GLOBAL <- paste (Sp , CC$NIP, Sp, CC$NOM1, Sp, CC$PRENOM, Sp, CC$CP, Sp,'0' )


C1 <- grep (CC$NOM1, T, value=TRUE)

if (length (C1) > 0) {

C2 <- grep (CC$PRENOM, C1, value=TRUE)

}

if (length (C2) > 0) {

C3 <- grep (CC$CP, C2, value=TRUE)

}

L <- length (C3)

if (L>0) {

for (G in 1:L) {

REP [K] <- paste (GLOBAL,Sp , G,Sp , C3 [G], Sp, 'X')

K <- K+1 }

}

}


RP <- as.vector (REP)


L <- length (RP)

EXC <- c ()

NIP <- c()

NOM <- c()

PRENOM <- c ()

CP <- c()

CTRL <- c()

NB <- c()

NUM <- c ()

NOMX <-c ()

NOM2 <- c()

FONCT <- c()

ADR1 <- c ()

ADR2 <- c ()

CPVILLE <- c ()

TEL <- c ()

FAX <- c()

NUMRUB <- c()

RUB <- c()

SIRET <- c()

for (G in 1:L) {

TX <- as.character (RP [G])

LTX <- length (TX)

NX <- 0

for (G2 in 1:LTX){

if (TX [G2] ==';') NX <- NX+1

}

if (NX==18) TX <- paste (TX,';XX')

TX <- paste (TX,';XX')


LL <- strsplit (TX,';')

EXC [G] <- LL [[1]] [1]

NIP [G] <- LL [[1]] [2]

NOM [G] <- LL [[1]] [3]

PRENOM [G] <- LL [[1]] [4]

CP [G] <- LL [[1]] [5]

CTRL [G] <- LL [[1]] [6]

NB [G] <- LL [[1]] [7]

NUM [G] <- LL [[1]] [8]

NOMX [G] <- LL [[1]] [9]

NOM2 [G] <- LL [[1]] [10]

FONCT [G] <- LL [[1]] [11]

ADR1 [G] <- LL [[1]] [12]

ADR2 [G] <- LL [[1]] [13]

CPVILLE [ G] <- LL [[1]] [14]

TEL [G] <- LL [[1]] [15]

FAX [G] <- LL [[1]] [16]

NUMRUB [G] <- LL [[1]] [17]

RUB [G] <- LL [[1]] [18]

SIRET [G] <- LL [[1]] [19]



}

DF <- data.frame (NIP, NOM, PRENOM, CP, CTRL, NB, NUM,NOMX,NOM2 , FONCT,ADR1, ADR2, CPVILLE, TEL, FAX, NUMRUB, RUB,SIRET)


write.dbf (DF, 'C:\\PJ\\RR3.dbf')

However this program return more data than needed. For exemple we have acentral town who is named « BESANCON ». So for the 2 people in my basewho were named « BESANCON » I found all the people with the same surnameand postal code (As my grep was asking for a link between my pattern(here BESANCON) with the entire line including the fields names,surnames and .. city).

So it is necessary to complete the first « extensive » search with acontrol one.

As the table saved with the first program has defined fields it is easyto open it and make a grep between the 2 names fields (the ones comingfrom each database) to control that the link is indeed between the 2names and not between name1 and city. Is it ??

Not really. Because grep is made to find a link between a pattern and alist or vector (of strings), not between 2 strings. So it returns eithera list with the indexes of the hits (here only one as the field name2 isa one string vector) or an empty list (integer (0). The problem is thatyou have no direct test for an empty list (either they do not split =is.numeric, is list are TRUE wether the list is empty or not, or theycreate an exception).

However sum (list) gives 0 for an empty list or 1 for a list with « 1 ».This may be used as a test

nb = the gsub function suppress some confusing characters including thespaces.


library (foreign)

H <- read.dbf ("C:\\PJ\\RR3.dbf")

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,'-_ ]","", H$NOM [G])

TX <- gsub ("[/?,'-_ ]","", H$NOMX [G])


if (sum (grep(TT,TX))==0) H$MAD [G] <- 1





}

write.dbf (H, 'C:\\PJ\\RRX.dbf')

Il is useful to have a marker of potential problems as the procedure mayhave « false positives », for exemple FONTAINE marc and FONTEINE marcare pointed as errors. A manual control is useful.

One extension is to program a bigram procedure (two strings are comparedby taking 2 character strings in the first and computing the rate ofhits in the second). Numeric values are given a *3 weigth.





library (foreign)

H <- read.dbf ('C:\\PJ\\PJAD.DBF')

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,' ]","", H$ADRPJ [G])

TX <- gsub ("[/?,' ]","", H$ADRF [G])


LTT <- nchar (TT)

LL <- LTT

CTTS <- 0

for (GG in 1:(LTT-2)){



CT <- substr (TT, GG,GG+1)




CTTS <- CTTS + CTT

}library (foreign)

H <- read.dbf ('C:\\PJ\\PJAD.DBF')

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,' ]","", H$ADRPJ [G])

TX <- gsub ("[/?,' ]","", H$ADRF [G])


LTT <- nchar (TT)

LL <- LTT

CTTS <- 0

for (GG in 1:(LTT-2)){



CT <- substr (TT, GG,GG+1)


CTT <- sum (grep (CT,TX))

if (sum (grep ("[0-9]", CT))){

CTT <- 3*CTT

LL <-LL+3

}


CTTS <- CTTS + CTT

}

CTTS1 <- (CTTS/LL)*100


H$MAD [G] <- CTTS1


}

write.dbf (H, 'C:\\PJ\\PJAD2.dbf')



------------------------------------------------------------------------

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching


_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

[R-sig-teaching] some tools for linking 2 text vectors

Reply via email to