Jim, No, this is _not the problem. If you go to my 1st mail I have a monster (at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual core AMD64 285 (the fastest at that time and still pretty fast now :-)
The machine stats paging when I run 2 copies of R working on two things like that :-). If you look at my last e-mail I found a solution but still have no clue why the heck x<-as.data.frame(y) where why is a list of the same columns take real for ever and this the thing that killed me before. Thanks, Latchezar > -----Original Message----- > From: jim holtman [mailto:[EMAIL PROTECTED] > Sent: Saturday, July 21, 2007 5:33 PM > To: Latchezar Dimitrov > Cc: Benilton Carvalho; [email protected] > Subject: Re: [R] Dataframe of factors transform speed? > > One of the problems is that you are probably paging on your > system with an object that size (240000 x 1000). This is > about 1GB for a single object: > > > set.seed(123) > > n <- 240000 > > system.time({ > + genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"), > + 1000, prob=c(1000, 1, 1), rep=T))) > + }) > user system elapsed > 95.00 0.61 104.71 > > names(genoT) = paste("snp", 1:n, sep="") > > > > object.size(genoT) > [1] 1045258752 > > > > I can create it on my 2GB machine as a list, but have > problems converting it to a dataframe because I don't have > enough memory. > > So unless you have at least 4GB on your system, it might take > a long time. Look at your performance measurements on your > system and see if you have run out of physical memory and are paging. > > On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Thanks for the help. My 1st question still unanswered though :-) > > Please see bellow > > > > > -----Original Message----- > > > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > > > Sent: Friday, July 20, 2007 3:30 AM > > > To: Latchezar Dimitrov > > > Cc: [email protected] > > > Subject: Re: [R] Dataframe of factors transform speed? > > > > > > set.seed(123) > > > genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB", > > > "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) > > > names(genoT) = paste("snp", 1:240000, sep="") genoT = > > > as.data.frame(genoT) > > > > Now this _is the problem. Everything before converting to > data.frame > > worked almost instantaneously however as.data.frame runs forever. > > Obviously there is some scalability memory management issue. When I > > tried my own method but creating a new result (instead of modifying > > the > > old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I > > figured 300,000 cols should be ~1000s. Nope! It ran for about > > 50,000(!)s to finish about 42,000 cols only. > > > > BTW, what ver. of R is yours? > > > > Now here's what I "discovered" further. > > > > #-- create a 1-col frame: > > geno <- > > > data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G > > AS > > P),rownames(geno.JAG))) > > > > #-- main code I repeated it w/ j in 1:1000, 2001:3000, and > 3001:4000, > > i.e., adding a 1000 of cols to geno each time > > > > system.time( > > # for(j in 1:(ncol(geno.GASP ))){ > > for(j in 3001:(4000 )){ > > gt.GASP<-geno.GASP[[j]] > > for(l in 1:length([EMAIL PROTECTED])){ > > levels(gt.GASP)[l] <- > > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > > } > > gt.JAG <-geno.JAG [[j]] > > # for(l in 1:length(gt.JAG @levels)){ > > # levels(gt.JAG )[l] <- switch(gt.JAG > > @levels[l],AA="0",AB="1",BB="2") > > # } > > geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 > > ### factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 > > ,as.numeric(factor(gt.JAG, levels=0:2))-1 > > ) > > ,levels=0:2 > > ) > > } > > ) > > > > Times (each one is for a 1000 cols!): > > [1] 26.673 0.032 26.705 0.000 0.000 [1] 77.186 0.037 > 77.225 0.000 > > 0.000 > > [1] 128.165 0.042 128.209 0.000 0.000 > > [1] 180.940 0.047 180.989 0.000 0.000 > > > > See the big diff and the scaling I mentioned above? > > > > Further more I removed geno[[j]] assignment leaving the operation > > though, i.e., replaced it with ### line above. Times: > > > > [1] 0.857 0.008 0.865 0.000 0.000 > > > > Huh!? What the heck! That's my second question :-) Any ideas? > > > > I still believe my method is near optimal. Of course I have > to somehow > > get rid of the assignment bottleneck. > > > > For now the lesson is: "God bless lists" > > > > Here is my final solution: > > > > > system.time({ > > + geno.GASP.L<-lapply(geno.GASP > > + ,function(x){ > > + for(l in > 1:length([EMAIL PROTECTED])){levels(x)[l] > > + <- > > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} > > + factor(x,levels=0:2) > > + } > > + ) > > + geno.JAG.L <-lapply(geno.JAG > > + ,function(x){ > > + # for(l in > 1:length([EMAIL PROTECTED])){levels(x)[l] <- > > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} > > + factor(x,levels=0:2) > > + } > > + ) > > + }) > > [1] 192.800 1.566 194.413 0.000 0.000 !!!!!!!!! :-))))) > > > system.time({ > > + class (geno.GASP.L)<-"data.frame" > > + row.names(geno.GASP.L)<-row.names(geno.GASP) > > + class (geno.JAG.L )<-"data.frame" > > + row.names(geno.JAG.L )<-row.names(geno.JAG ) > > + }) > > [1] 12.156 0.001 12.155 0.000 0.000 > > > system.time({ > > + geno<-rbind(geno.GASP.L,geno.JAG.L) > > + }) > > [1] 1542.340 9.072 2066.310 0.000 0.000 > > > > I logged my notes here as I was trying various things. Partly the > > reason is my two questions: > > > > "What was wrong with me?" and > > "What the heck?!" remember above? :-))) > > > > which still remain unanswered :-( > > > > I would have had a lot of fun if I had not to have this done by ... > > Yesterday :-)) > > > > Thanks a lot for the help > > > > Latchezar > > > > > dim(genoT) > > > class(genoT) > > > system.time(out <- lapply(genoT, function(x) match(x, > c("AA", "AB", > > > "BB"))-1)) > > > ## > > > ## > > > user system elapsed > > > 119.288 0.004 119.339 > > > > > > (for all 240K) > > > > > > best, > > > b > > > > > > ps: note that "out" is a list. > > > > > > On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote: > > > > > > > Hi, > > > > > > > >> -----Original Message----- > > > >> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > > > >> Sent: Friday, July 20, 2007 12:25 AM > > > >> To: Latchezar Dimitrov > > > >> Cc: [email protected] > > > >> Subject: Re: [R] Dataframe of factors transform speed? > > > >> > > > >> it looks like that whatever method you used to genotype the > > > >> 1002 samples on the STY array gave you a transposed matrix of > > > >> genotype calls. :-) > > > > > > > > It only looks like :-) > > > > > > > > Otherwise it is correctly created dataframe of 1002 > samples X (big > > > > number) of columns (SNP genotypes). It worked perfectly until I > > > > decided to put together to cohorts independently processed in R > > > > already. I got stuck with my lack of foreseeing. > Otherwise I would > > > > have put 3 dummy lines w/ AA,AB, and AB on each one to make > > > sure all 3 > > > > genotypes are present and that's it! Lesson for the future :-) > > > > > > > > Maybe I am not using columns and rows appropriately > here but the > > > > dataframe is correct (I have not used FORTRAN since > FORTRAN IV ;-) > > > > - as > > > > str says 1002 observ. of (big number) vars. > > > > > > > >> > > > >> i'd use: > > > >> > > > >> genoT = read.table(yourFile, stringsAsFactors = FALSE) > > > >> > > > >> as a starting point... but I don't think that would be > > > efficient (as > > > >> you'd need to fix one column at a time - lapply). > > > > > > > > No it was not efficient at all. 'matter of fact nothing is more > > > > efficient then loading already read data, alas :-( > > > > > > > >> > > > >> i'd preprocess yourFile before trying to load it: > > > >> > > > >> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e > > > >> 's/BB/3/ g' > outFile > > > >> > > > >> and, now, in R: > > > >> > > > >> genoT = read.table(outFile, header=TRUE) > > > > > > > > ... Too late ;-) As it must be clear now I have two > > > dataframes I want > > > > to put together with rbind(geno1,geno2). The issue again is > > > > "uniformization" of factor variables w/ missing factors - > > > they ended > > > > up like levels AA,BB on one of the and levels AB,BB on the > > > other which > > > > means as.numeric of AA is 1 on the 1st and as.numeric > of AB is 1 > > > > on the second - complete mess. That's why I tried to make both > > > uniform, > > > > i.e. > > > > levels "AA","AB", and "BB" for every SNP and then rbind works. > > > > > > > > In any case my 1st questions remains: "What's wrong > with me?" :-) > > > > > > > > Thanks, > > > > Latchezar > > > > > > > >> > > > >> b > > > >> > > > >> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote: > > > >> > > > >>> Hello, > > > >>> > > > >>> This is a speed question. I have a dataframe genoT: > > > >>> > > > >>>> dim(genoT) > > > >>> [1] 1002 238304 > > > >>> > > > >>>> str(genoT) > > > >>> 'data.frame': 1002 obs. of 238304 variables: > > > >>> $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 > > > >> 3 3 3 3 3 > > > >>> ... > > > >>> $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 > > > >> 1 1 2 2 2 > > > >>> ... > > > >>> $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 > > > >> 1 1 1 1 1 > > > >>> ... > > > >>> $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 > > > >> 3 3 3 3 3 > > > >>> ... > > > >>> $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 > > > >> 3 2 3 3 1 > > > >>> ... > > > >>> $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA > > > 1 NA 2 1 1 > > > >>> 2 1 > > > >>> ... > > > >>> $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 > > > >> 1 1 1 1 2 > > > >>> ... > > > >>> $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 > > > >> 3 3 3 3 2 > > > >>> ... > > > >>> $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 > > > >> 1 1 1 1 2 > > > >>> ... > > > >>> $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 > > > >> 1 2 1 1 3 > > > >>> ... > > > >>> $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": > 2 2 3 3 3 2 > > > >>> 1 > > > >>> 2 2 3 > > > >>> ... > > > >>> $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": > 3 3 3 3 3 3 > > > >>> 3 > > > >>> 3 3 3 > > > >>> ... > > > >>> $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": > 2 2 2 1 1 1 > > > >>> 2 > > > >>> 2 2 2 > > > >>> ... > > > >>> $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 > 1 1 1 1 1 > > > >>> 1 > > > >>> 1 ... > > > >>> $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": > 2 3 2 2 3 2 > > > >>> 2 > > > >>> 1 1 2 > > > >>> ... > > > >>> $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": > 1 1 1 1 1 1 > > > >>> 1 > > > >>> 1 1 1 > > > >>> ... > > > >>> $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": > 1 1 2 1 1 1 > > > >>> 1 > > > >>> 1 1 1 > > > >>> ... > > > >>> $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 > 1 1 1 1 1 > > > >>> 1 > > > >>> 1 ... > > > >>> $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": > 1 1 1 1 1 2 > > > >>> 1 > > > >>> 1 1 2 > > > >>> ... > > > >>> $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 > > > >> 2 2 NA 1 NA > > > >>> 2 > > > >>> 1 ... > > > >>> $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": > 1 2 2 1 1 1 > > > >>> 3 > > > >>> 1 1 1 > > > >>> ... > > > >>> $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 > 2 2 2 2 2 > > > >>> 2 > > > >>> 2 ... > > > >>> $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 > 1 1 1 1 1 > > > >>> 1 > > > >>> 1 ... > > > >>> $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": > 1 1 1 1 1 1 > > > >>> 1 > > > >>> 2 2 1 > > > >>> ... > > > >>> > > > >>> Its columns are factors with different number of levels > > > >> (from 1 to 3 - > > > >>> that's what I got from read.table, i.e., it dropped missing > > > >> levels). I > > > >>> want to convert it to uniform factors with 3 levels. The > > > >> 1st 10 rows > > > >>> above show already converted columns and the rest are not yet > > > >>> converted. > > > >>> Here's my attempt wich is a complete failure as speed: > > > >>> > > > >>>> system.time( > > > >>> + for(j in 1:(10 )){ #-- this is to try 1st > > > 10 cols and > > > >>> measure the time, it otherwise is ncol(genoT) instead of 10 > > > >>> > > > >>> + gt<-genoT[[j]] #-- this is to avoid > 2D indices > > > >>> + for(l in 1:length([EMAIL PROTECTED])){ > > > >>> + levels(gt)[l] <- > > > >> switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > > > >>> #-- convert levels to "0","1", or "2" > > > >>> + genoT[[j]]<-factor(gt,levels=0:2) #-- > make a 3-level > > > >>> factor > > > >>> and put it back > > > >>> + } > > > >>> + } > > > >>> + ) > > > >>> [1] 785.085 4.358 789.454 0.000 0.000 > > > >>> > > > >>> 789s for 10 columns only! > > > >>> > > > >>> To me it seems like replacing 10 x 3 levels and then making > > > >> a factor > > > >>> of > > > >>> 1002 element vector x 10 is a "negligible" amount of > operations > > > >>> needed. > > > >>> > > > >>> So, what's wrong with me? Any idea how to accelerate > > > >> significantly the > > > >>> transformation or (to go to the very beginning) to make > > > >> read.table use > > > >>> a fixed set of levels ("AA","AB", and "BB") and not > to drop any > > > >>> (missing) > > > >>> level? > > > >>> > > > >>> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit > > > >>> > > > >>> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) > > > >> so it's not > > > >>> it. > > > >>> > > > >>> Thank you very much for the help, > > > >>> > > > >>> Latchezar Dimitrov, > > > >>> Analyst/Programmer IV, > > > >>> Wake Forest University School of Medicine, > Winston-Salem, North > > > >>> Carolina, USA > > > >>> > > > >>> ______________________________________________ > > > >>> [email protected] mailing list > > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > > >>> PLEASE do read the posting guide > > > http://www.R-project.org/posting- > > > >>> guide.html and provide commented, minimal, self-contained, > > > >>> reproducible code. > > > >> > > > > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem you are trying to solve? > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
