I used R somehow « strangely » for a rather common problem facing programmers confronted with heterogeneous databases (who may or not be statistician)=

How to link database B with Base A when both base's data are totally or partially un-standardized text ?

At first glance, R is NOT the ideal tool for that, however =


potent string tools are included in the base R (grep, agrep)

R is multi platform (Linux, Mac, Windows)

With R it is easy to read and write dbf ( Dbase) files , ans from Dbase to excel, access and so on


So I propose here a quite primitive tool, achieving this aim (where some tricks may be useful). Of course, programmers will make it more adapted to their specific aims. I hope it may be useful to some of you. Some functions were unexpectedly difficult (how to manage empty lists, how to import list data in a database row ) and I hope my solutions wille help.


The fist program links a Dbase file with structured and standardized fields = NIP= index number, name=NOM, surname=PRENOM, postal code=CP and a csv file with unstructured text (identity, adress, postal code).(yellow pages)


I use the grep (or agrep function) to find a pattern in a text.(agrep find approached links)

I use three successive passes to find the links for name, then surname (among the result file) and then Postal code If this last list of results is not empty I write it in a Dbase datafile.


With the grep function you can return either the index number of the « hit » or it's value (with the option value=TRUE). The problem is that you return only the field who shows a link. So you look for « smith » and have in return « smith » . It's not very useful.

The trick is to use the scan function to read the csv file with the return character as sep (separator). Then you have in return the entire line (with ; virtually separating fields) inclunding the index number and so on. .

However the return is a list of strings . Importing the data from a list to a database is quite tedious (see below)

library (foreign)

TT<- read.dbf ('C:\\PJ\\TFC.dbf')

T <- scan ('C:\\PJ\\TT.CSV', what='string', sep='\r')

REP <- c()

K <- 1

Sp <- ';'


LTT <- length (TT$NOM1)

for ( J in 1:LTT){



C1 <- c ()

C2 <- c ()

C3 <- c ()


CC <- data.frame( TT [J,])

GLOBAL <- paste (Sp , CC$NIP, Sp, CC$NOM1, Sp, CC$PRENOM, Sp, CC$CP, Sp, '0' )

C1 <- grep (CC$NOM1, T, value=TRUE)

if (length (C1) > 0) {

C2 <- grep (CC$PRENOM, C1, value=TRUE)

}

if (length (C2) > 0) {

C3 <- grep (CC$CP, C2, value=TRUE)

}

L <- length (C3)

if (L>0) {

for (G in 1:L) {

REP [K] <- paste (GLOBAL,Sp , G,Sp , C3 [G], Sp, 'X')

K <- K+1 }

}

}


RP <- as.vector (REP)


L <- length (RP)

EXC <- c ()

NIP <- c()

NOM <- c()

PRENOM <- c ()

CP <- c()

CTRL <- c()

NB <- c()

NUM <- c ()

NOMX <-c ()

NOM2 <- c()

FONCT <- c()

ADR1 <- c ()

ADR2 <- c ()

CPVILLE <- c ()

TEL <- c ()

FAX <- c()

NUMRUB <- c()

RUB <- c()

SIRET <- c()

for (G in 1:L) {

TX <- as.character (RP [G])

LTX <- length (TX)

NX <- 0

for (G2 in 1:LTX){

if (TX [G2] ==';') NX <- NX+1

}

if (NX==18) TX <- paste (TX,';XX')

TX <- paste (TX,';XX')


LL <- strsplit (TX,';')

EXC [G] <- LL [[1]] [1]

NIP [G] <- LL [[1]] [2]

NOM [G] <- LL [[1]] [3]

PRENOM [G] <- LL [[1]] [4]

CP [G] <- LL [[1]] [5]

CTRL [G] <- LL [[1]] [6]

NB [G] <- LL [[1]] [7]

NUM [G] <- LL [[1]] [8]

NOMX [G] <- LL [[1]] [9]

NOM2 [G] <- LL [[1]] [10]

FONCT [G] <- LL [[1]] [11]

ADR1 [G] <- LL [[1]] [12]

ADR2 [G] <- LL [[1]] [13]

CPVILLE [ G] <- LL [[1]] [14]

TEL [G] <- LL [[1]] [15]

FAX [G] <- LL [[1]] [16]

NUMRUB [G] <- LL [[1]] [17]

RUB [G] <- LL [[1]] [18]

SIRET [G] <- LL [[1]] [19]



}


DF <- data.frame (NIP, NOM, PRENOM, CP, CTRL, NB, NUM,NOMX,NOM2 , FONCT, ADR1, ADR2, CPVILLE, TEL, FAX, NUMRUB, RUB,SIRET)

write.dbf (DF, 'C:\\PJ\\RR3.dbf')




However this program return more data than needed. For exemple we have a central town who is named « BESANCON ». So for the 2 people in my base who were named « BESANCON » I found all the people with the same surname and postal code (As my grep was asking for a link between my pattern (here BESANCON) with the entire line including the fields names, surnames and .. city).


So it is necessary to complete the first « extensive » search with a control one.

As the table saved with the first program has defined fields it is easy to open it and make a grep between the 2 names fields (the ones coming from each database) to control that the link is indeed between the 2 names and not between name1 and city. Is it ??

Not really. Because grep is made to find a link between a pattern and a list or vector (of strings), not between 2 strings. So it returns either a list with the indexes of the hits (here only one as the field name2 is a one string vector) or an empty list (integer (0). The problem is that you have no direct test for an empty list (either they do not split = is.numeric, is list are TRUE wether the list is empty or not, or they create an exception).

However sum (list) gives 0 for an empty list or 1 for a list with « 1 ». This may be used as a test

nb = the gsub function suppress some confusing characters including the spaces.

library (foreign)

H <- read.dbf ("C:\\PJ\\RR3.dbf")

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,'-_ ]","", H$NOM [G])

TX <- gsub ("[/?,'-_ ]","", H$NOMX [G])


if (sum (grep(TT,TX))==0) H$MAD [G] <- 1





}

write.dbf (H, 'C:\\PJ\\RRX.dbf')


Il is useful to have a marker of potential problems as the procedure may have « false positives », for exemple FONTAINE marc and FONTEINE marc are pointed as errors. A manual control is useful.



One extension is to program a bigram procedure (two strings are compared by taking 2 character strings in the first and computing the rate of hits in the second). Numeric values are given a *3 weigth.




library (foreign)

H <- read.dbf ('C:\\PJ\\PJAD.DBF')

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,' ]","", H$ADRPJ [G])

TX <- gsub ("[/?,' ]","", H$ADRF [G])


LTT <- nchar (TT)

LL <- LTT

CTTS <- 0

for (GG in 1:(LTT-2)){



CT <- substr (TT, GG,GG+1)




CTTS <- CTTS + CTT

}library (foreign)

H <- read.dbf ('C:\\PJ\\PJAD.DBF')

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,' ]","", H$ADRPJ [G])

TX <- gsub ("[/?,' ]","", H$ADRF [G])


LTT <- nchar (TT)

LL <- LTT

CTTS <- 0

for (GG in 1:(LTT-2)){



CT <- substr (TT, GG,GG+1)


CTT <- sum (grep (CT,TX))

if (sum (grep ("[0-9]", CT))){

CTT <- 3*CTT

LL <-LL+3

}


CTTS <- CTTS + CTT

}

CTTS1 <- (CTTS/LL)*100


H$MAD [G] <- CTTS1


}

write.dbf (H, 'C:\\PJ\\PJAD2.dbf')




------------------------------------------------------------------------

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching


_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

Reply via email to