set.seed(123) genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) names(genoT) = paste("snp", 1:240000, sep="") genoT = as.data.frame(genoT) dim(genoT) class(genoT) system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB", "BB"))-1)) ## ## user system elapsed 119.288 0.004 119.339
(for all 240K) best, b ps: note that "out" is a list. On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote: > Hi, > >> -----Original Message----- >> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] >> Sent: Friday, July 20, 2007 12:25 AM >> To: Latchezar Dimitrov >> Cc: r-help@stat.math.ethz.ch >> Subject: Re: [R] Dataframe of factors transform speed? >> >> it looks like that whatever method you used to genotype the >> 1002 samples on the STY array gave you a transposed matrix of >> genotype calls. :-) > > It only looks like :-) > > Otherwise it is correctly created dataframe of 1002 samples X (big > number) of columns (SNP genotypes). It worked perfectly until I > decided > to put together to cohorts independently processed in R already. I got > stuck with my lack of foreseeing. Otherwise I would have put 3 dummy > lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are > present and that's it! Lesson for the future :-) > > Maybe I am not using columns and rows appropriately here but the > dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) > - as > str says 1002 observ. of (big number) vars. > >> >> i'd use: >> >> genoT = read.table(yourFile, stringsAsFactors = FALSE) >> >> as a starting point... but I don't think that would be >> efficient (as you'd need to fix one column at a time - lapply). > > No it was not efficient at all. 'matter of fact nothing is more > efficient then loading already read data, alas :-( > >> >> i'd preprocess yourFile before trying to load it: >> >> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e >> 's/BB/3/ g' > outFile >> >> and, now, in R: >> >> genoT = read.table(outFile, header=TRUE) > > ... Too late ;-) As it must be clear now I have two dataframes I > want to > put together with rbind(geno1,geno2). The issue again is > "uniformization" of factor variables w/ missing factors - they > ended up > like levels AA,BB on one of the and levels AB,BB on the other which > means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on > the > second - complete mess. That's why I tried to make both uniform, i.e. > levels "AA","AB", and "BB" for every SNP and then rbind works. > > In any case my 1st questions remains: "What's wrong with me?" :-) > > Thanks, > Latchezar > >> >> b >> >> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote: >> >>> Hello, >>> >>> This is a speed question. I have a dataframe genoT: >>> >>>> dim(genoT) >>> [1] 1002 238304 >>> >>>> str(genoT) >>> 'data.frame': 1002 obs. of 238304 variables: >>> $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 >> 3 3 3 3 3 >>> ... >>> $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 >> 1 1 2 2 2 >>> ... >>> $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 >> 1 1 1 1 1 >>> ... >>> $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 >> 3 3 3 3 3 >>> ... >>> $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 >> 3 2 3 3 1 >>> ... >>> $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 >>> 2 1 >>> ... >>> $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 >> 1 1 1 1 2 >>> ... >>> $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 >> 3 3 3 3 2 >>> ... >>> $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 >> 1 1 1 1 2 >>> ... >>> $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 >> 1 2 1 1 3 >>> ... >>> $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 >>> 2 2 3 >>> ... >>> $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 >>> 3 3 3 >>> ... >>> $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 >>> 2 2 2 >>> ... >>> $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 >>> 1 ... >>> $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 >>> 1 1 2 >>> ... >>> $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 >>> 1 1 1 >>> ... >>> $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 >>> 1 1 1 >>> ... >>> $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 >>> 1 ... >>> $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 >>> 1 1 2 >>> ... >>> $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 >> 2 2 NA 1 NA >>> 2 >>> 1 ... >>> $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 >>> 1 1 1 >>> ... >>> $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 >>> 2 ... >>> $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 >>> 1 ... >>> $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 >>> 2 2 1 >>> ... >>> >>> Its columns are factors with different number of levels >> (from 1 to 3 - >>> that's what I got from read.table, i.e., it dropped missing >> levels). I >>> want to convert it to uniform factors with 3 levels. The >> 1st 10 rows >>> above show already converted columns and the rest are not yet >>> converted. >>> Here's my attempt wich is a complete failure as speed: >>> >>>> system.time( >>> + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and >>> measure the time, it otherwise is ncol(genoT) instead of 10 >>> >>> + gt<-genoT[[j]] #-- this is to avoid 2D indices >>> + for(l in 1:length([EMAIL PROTECTED])){ >>> + levels(gt)[l] <- >> switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") >>> #-- convert levels to "0","1", or "2" >>> + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level >>> factor >>> and put it back >>> + } >>> + } >>> + ) >>> [1] 785.085 4.358 789.454 0.000 0.000 >>> >>> 789s for 10 columns only! >>> >>> To me it seems like replacing 10 x 3 levels and then making >> a factor >>> of >>> 1002 element vector x 10 is a "negligible" amount of operations >>> needed. >>> >>> So, what's wrong with me? Any idea how to accelerate >> significantly the >>> transformation or (to go to the very beginning) to make >> read.table use >>> a fixed set of levels ("AA","AB", and "BB") and not to drop any >>> (missing) >>> level? >>> >>> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit >>> >>> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) >> so it's not >>> it. >>> >>> Thank you very much for the help, >>> >>> Latchezar Dimitrov, >>> Analyst/Programmer IV, >>> Wake Forest University School of Medicine, Winston-Salem, North >>> Carolina, USA >>> >>> ______________________________________________ >>> R-help@stat.math.ethz.ch mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting- >>> guide.html and provide commented, minimal, self-contained, >>> reproducible code. >> ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.