On Thu, 19 Jul 2007, Latchezar Dimitrov wrote: > Hello, > > This is a speed question. I have a dataframe genoT: > >> dim(genoT) > [1] 1002 238304
It looks like these are all numeric originally. Handling these as a vector or matrix will speed things up a bit. You can then stitch together a data.frame: # simulate: # genoT.names <- scan('data.file, what='a', nlines=1, <etc> ) # genoT <- scan('data.file',skip=1) # > > genoT <- sample(0:2, 240000*1002, repl=T) > t1 <- proc.time() > genoT <- factor(genoT,0:2,c("AA","AB","BB")) > dim(genoT) <- c(1002,240000) > genoT.list <- lapply(1:240000, function(x) genoT[,x]) > # simulate: names(genoT.list) <- genoT.names : > names(genoT.list) <- make.names(1:240000) > class(genoT.list) <- "data.frame" > row.names(genoT.list) <- 1:1002 > proc.time()-t1 user system elapsed 20.978 2.036 49.714 > Most of the _elapsed_ time is due to lags in copy-and-paste-ing in the commands. HTH, Chuck > >> str(genoT) > 'data.frame': 1002 obs. of 238304 variables: > $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2 > ... > $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1 > ... > $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1 > ... > $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 > ... > $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3 > ... > $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3 > ... > $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2 > ... > $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2 > ... > $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1 > ... > $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2 > ... > $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2 > 1 ... > $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1 > ... > $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ... > $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1 > ... > > Its columns are factors with different number of levels (from 1 to 3 - > that's what I got from read.table, i.e., it dropped missing levels). I > want to convert it to uniform factors with 3 levels. The 1st 10 rows > above show already converted columns and the rest are not yet converted. > Here's my attempt wich is a complete failure as speed: > >> system.time( > + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and > measure the time, it otherwise is ncol(genoT) instead of 10 > > + gt<-genoT[[j]] #-- this is to avoid 2D indices > + for(l in 1:length([EMAIL PROTECTED])){ > + levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > #-- convert levels to "0","1", or "2" > + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level factor > and put it back > + } > + } > + ) > [1] 785.085 4.358 789.454 0.000 0.000 > > 789s for 10 columns only! > > To me it seems like replacing 10 x 3 levels and then making a factor of > 1002 element vector x 10 is a "negligible" amount of operations needed. > > So, what's wrong with me? Any idea how to accelerate significantly the > transformation or (to go to the very beginning) to make read.table use a > fixed set of levels ("AA","AB", and "BB") and not to drop any (missing) > level? > > R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit > > The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not > it. > > Thank you very much for the help, > > Latchezar Dimitrov, > Analyst/Programmer IV, > Wake Forest University School of Medicine, > Winston-Salem, North Carolina, USA > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.