Hi, Thanks for the help. My 1st question still unanswered though :-) Please see bellow
> -----Original Message----- > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > Sent: Friday, July 20, 2007 3:30 AM > To: Latchezar Dimitrov > Cc: [email protected] > Subject: Re: [R] Dataframe of factors transform speed? > > set.seed(123) > genoT = lapply(1:240000, function(i) factor(sample(c("AA", > "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) > names(genoT) = paste("snp", 1:240000, sep="") genoT = > as.data.frame(genoT) Now this _is the problem. Everything before converting to data.frame worked almost instantaneously however as.data.frame runs forever. Obviously there is some scalability memory management issue. When I tried my own method but creating a new result (instead of modifying the old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s to finish about 42,000 cols only. BTW, what ver. of R is yours? Now here's what I "discovered" further. #-- create a 1-col frame: geno <- data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS P),rownames(geno.JAG))) #-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000, i.e., adding a 1000 of cols to geno each time system.time( # for(j in 1:(ncol(geno.GASP ))){ for(j in 3001:(4000 )){ gt.GASP<-geno.GASP[[j]] for(l in 1:length([EMAIL PROTECTED])){ levels(gt.GASP)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") } gt.JAG <-geno.JAG [[j]] # for(l in 1:length(gt.JAG @levels)){ # levels(gt.JAG )[l] <- switch(gt.JAG @levels[l],AA="0",AB="1",BB="2") # } geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 ### factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 ,as.numeric(factor(gt.JAG, levels=0:2))-1 ) ,levels=0:2 ) } ) Times (each one is for a 1000 cols!): [1] 26.673 0.032 26.705 0.000 0.000 [1] 77.186 0.037 77.225 0.000 0.000 [1] 128.165 0.042 128.209 0.000 0.000 [1] 180.940 0.047 180.989 0.000 0.000 See the big diff and the scaling I mentioned above? Further more I removed geno[[j]] assignment leaving the operation though, i.e., replaced it with ### line above. Times: [1] 0.857 0.008 0.865 0.000 0.000 Huh!? What the heck! That's my second question :-) Any ideas? I still believe my method is near optimal. Of course I have to somehow get rid of the assignment bottleneck. For now the lesson is: "God bless lists" Here is my final solution: > system.time({ + geno.GASP.L<-lapply(geno.GASP + ,function(x){ + for(l in 1:length([EMAIL PROTECTED])){levels(x)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} + factor(x,levels=0:2) + } + ) + geno.JAG.L <-lapply(geno.JAG + ,function(x){ + # for(l in 1:length([EMAIL PROTECTED])){levels(x)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} + factor(x,levels=0:2) + } + ) + }) [1] 192.800 1.566 194.413 0.000 0.000 !!!!!!!!! :-))))) > system.time({ + class (geno.GASP.L)<-"data.frame" + row.names(geno.GASP.L)<-row.names(geno.GASP) + class (geno.JAG.L )<-"data.frame" + row.names(geno.JAG.L )<-row.names(geno.JAG ) + }) [1] 12.156 0.001 12.155 0.000 0.000 > system.time({ + geno<-rbind(geno.GASP.L,geno.JAG.L) + }) [1] 1542.340 9.072 2066.310 0.000 0.000 I logged my notes here as I was trying various things. Partly the reason is my two questions: "What was wrong with me?" and "What the heck?!" remember above? :-))) which still remain unanswered :-( I would have had a lot of fun if I had not to have this done by ... Yesterday :-)) Thanks a lot for the help Latchezar > dim(genoT) > class(genoT) > system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB", > "BB"))-1)) > ## > ## > user system elapsed > 119.288 0.004 119.339 > > (for all 240K) > > best, > b > > ps: note that "out" is a list. > > On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote: > > > Hi, > > > >> -----Original Message----- > >> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > >> Sent: Friday, July 20, 2007 12:25 AM > >> To: Latchezar Dimitrov > >> Cc: [email protected] > >> Subject: Re: [R] Dataframe of factors transform speed? > >> > >> it looks like that whatever method you used to genotype the > >> 1002 samples on the STY array gave you a transposed matrix of > >> genotype calls. :-) > > > > It only looks like :-) > > > > Otherwise it is correctly created dataframe of 1002 samples X (big > > number) of columns (SNP genotypes). It worked perfectly until I > > decided to put together to cohorts independently processed in R > > already. I got stuck with my lack of foreseeing. Otherwise I would > > have put 3 dummy lines w/ AA,AB, and AB on each one to make > sure all 3 > > genotypes are present and that's it! Lesson for the future :-) > > > > Maybe I am not using columns and rows appropriately here but the > > dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) > > - as > > str says 1002 observ. of (big number) vars. > > > >> > >> i'd use: > >> > >> genoT = read.table(yourFile, stringsAsFactors = FALSE) > >> > >> as a starting point... but I don't think that would be > efficient (as > >> you'd need to fix one column at a time - lapply). > > > > No it was not efficient at all. 'matter of fact nothing is more > > efficient then loading already read data, alas :-( > > > >> > >> i'd preprocess yourFile before trying to load it: > >> > >> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e > >> 's/BB/3/ g' > outFile > >> > >> and, now, in R: > >> > >> genoT = read.table(outFile, header=TRUE) > > > > ... Too late ;-) As it must be clear now I have two > dataframes I want > > to put together with rbind(geno1,geno2). The issue again is > > "uniformization" of factor variables w/ missing factors - > they ended > > up like levels AA,BB on one of the and levels AB,BB on the > other which > > means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on > > the second - complete mess. That's why I tried to make both > uniform, > > i.e. > > levels "AA","AB", and "BB" for every SNP and then rbind works. > > > > In any case my 1st questions remains: "What's wrong with me?" :-) > > > > Thanks, > > Latchezar > > > >> > >> b > >> > >> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote: > >> > >>> Hello, > >>> > >>> This is a speed question. I have a dataframe genoT: > >>> > >>>> dim(genoT) > >>> [1] 1002 238304 > >>> > >>>> str(genoT) > >>> 'data.frame': 1002 obs. of 238304 variables: > >>> $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 > >> 3 3 3 3 3 > >>> ... > >>> $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 > >> 1 1 2 2 2 > >>> ... > >>> $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 > >> 1 1 1 1 1 > >>> ... > >>> $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 > >> 3 3 3 3 3 > >>> ... > >>> $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 > >> 3 2 3 3 1 > >>> ... > >>> $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA > 1 NA 2 1 1 > >>> 2 1 > >>> ... > >>> $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 > >> 1 1 1 1 2 > >>> ... > >>> $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 > >> 3 3 3 3 2 > >>> ... > >>> $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 > >> 1 1 1 1 2 > >>> ... > >>> $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 > >> 1 2 1 1 3 > >>> ... > >>> $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 > >>> 2 2 3 > >>> ... > >>> $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 > >>> 3 3 3 > >>> ... > >>> $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 > >>> 2 2 2 > >>> ... > >>> $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 > >>> 1 ... > >>> $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 > >>> 1 1 2 > >>> ... > >>> $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 > >>> 1 1 1 > >>> ... > >>> $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 > >>> 1 1 1 > >>> ... > >>> $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 > >>> 1 ... > >>> $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 > >>> 1 1 2 > >>> ... > >>> $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 > >> 2 2 NA 1 NA > >>> 2 > >>> 1 ... > >>> $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 > >>> 1 1 1 > >>> ... > >>> $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 > >>> 2 ... > >>> $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 > >>> 1 ... > >>> $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 > >>> 2 2 1 > >>> ... > >>> > >>> Its columns are factors with different number of levels > >> (from 1 to 3 - > >>> that's what I got from read.table, i.e., it dropped missing > >> levels). I > >>> want to convert it to uniform factors with 3 levels. The > >> 1st 10 rows > >>> above show already converted columns and the rest are not yet > >>> converted. > >>> Here's my attempt wich is a complete failure as speed: > >>> > >>>> system.time( > >>> + for(j in 1:(10 )){ #-- this is to try 1st > 10 cols and > >>> measure the time, it otherwise is ncol(genoT) instead of 10 > >>> > >>> + gt<-genoT[[j]] #-- this is to avoid 2D indices > >>> + for(l in 1:length([EMAIL PROTECTED])){ > >>> + levels(gt)[l] <- > >> switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > >>> #-- convert levels to "0","1", or "2" > >>> + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level > >>> factor > >>> and put it back > >>> + } > >>> + } > >>> + ) > >>> [1] 785.085 4.358 789.454 0.000 0.000 > >>> > >>> 789s for 10 columns only! > >>> > >>> To me it seems like replacing 10 x 3 levels and then making > >> a factor > >>> of > >>> 1002 element vector x 10 is a "negligible" amount of operations > >>> needed. > >>> > >>> So, what's wrong with me? Any idea how to accelerate > >> significantly the > >>> transformation or (to go to the very beginning) to make > >> read.table use > >>> a fixed set of levels ("AA","AB", and "BB") and not to drop any > >>> (missing) > >>> level? > >>> > >>> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit > >>> > >>> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) > >> so it's not > >>> it. > >>> > >>> Thank you very much for the help, > >>> > >>> Latchezar Dimitrov, > >>> Analyst/Programmer IV, > >>> Wake Forest University School of Medicine, Winston-Salem, North > >>> Carolina, USA > >>> > >>> ______________________________________________ > >>> [email protected] mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > http://www.R-project.org/posting- > >>> guide.html and provide commented, minimal, self-contained, > >>> reproducible code. > >> > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
