Thanks, please find what I got: > str(getProfileData(cgds,GeneList, "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27")) 'data.frame': 48 obs. of 10 variables: $ ATM : num NA NA NA NA NA NA NA NA NA NA ... $ ATR : num NA NA NA NA NA NA NA NA NA NA ... $ DDR2 : num 0.714 0.857 0.549 0.669 0.587 ... $ HPGDS: num 0.505 0.722 0.528 0.411 0.497 ... $ MDC1 : num NA NA NA NA NA NA NA NA NA NA ... $ MLH1 : num NA NA NA NA NA NA NA NA NA NA ... $ MS4A2: num 0.83 0.853 0.835 0.716 0.481 ... $ MSH2 : num NA NA NA NA NA NA NA NA NA NA ... $ PARP1: num NA NA NA NA NA NA NA NA NA NA ... $ SSUH2: num 0.73 0.842 0.794 0.854 0.803 ... > str(getProfileData(cgds,GeneList, "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450")) 'data.frame': 338 obs. of 10 variables: $ ATM : Factor w/ 338 levels "0.01060883","0.01065690",..: 256 182 170 101 53 302 183 236 298 334 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ ATR : Factor w/ 338 levels "0.009422188",..: 271 265 165 215 222 304 176 170 228 277 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ DDR2 : Factor w/ 338 levels "0.38369598","0.42008010",..: 197 161 25 291 40 38 155 85 177 180 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ HPGDS: Factor w/ 338 levels "0.16077929","0.18867898",..: 85 56 208 281 116 67 132 119 152 49 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ MDC1 : Factor w/ 338 levels "0.06105770","0.06532153",..: 162 267 185 180 253 220 108 230 239 271 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ MLH1 : Factor w/ 338 levels "0.009031445",..: 299 194 160 45 198 224 115 167 287 165 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ MS4A2: Factor w/ 338 levels "0.31286204","0.438797860",..: 266 210 329 111 40 49 21 68 134 331 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ MSH2 : Factor w/ 338 levels "0.009568869",..: 260 270 179 114 215 137 263 78 300 283 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ PARP1: Factor w/ 338 levels "0.01110587","0.01208177",..: 249 260 65 191 219 204 32 132 130 225 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... $ SSUH2: Factor w/ 338 levels "0.17618607","0.184911562",..: 243 276 93 82 99 236 51 88 163 138 ... ..- attr(*, "names")= chr "TCGA.BR.6452.01" "TCGA.BR.6453.01" "TCGA.BR.6454.01" "TCGA.BR.6455.01" ... >
Ô__ c/ /'_;~~~~kmezhoud (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ http://bioinformatics.tn/ On Wed, Dec 31, 2014 at 6:39 PM, William Dunlap <[email protected]> wrote: > > But this heterogeneity comes even with only supposed numeric data.frame > > (gene expression). here an example > > > > ibrary(cgdsr) > > GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM" > > ,"ATR", "MDC1" ,"PARP1") > > cgds<-CGDS("http://www.cbioportal.org/public-portal/") > > > > str(getProfileData(cgds,GeneList, > > "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27")) > > > > str(getProfileData(cgds,GeneList, > > "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450")) > > > > With my computer I did not find the same structure (numeric vs factor). > > Can you show us what you got. I am a bit surprised that you got any > factors > because putting a trace on read.table shows that getProfileData calls it > with as.is=TRUE (meaning to not convert character columns to factors). I > got > all numeric columns: > > trace(read.table) > > str(getProfileData(cgds,GeneList, > + "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27")) > trace: read.table(url, skip = 0, header = TRUE, as.is = TRUE, sep = > "\t", > quote = "") > 'data.frame': 48 obs. of 10 variables: > $ ATM : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... > $ ATR : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... > $ DDR2 : num 0.714 0.857 0.549 0.669 0.587 ... > $ HPGDS: num 0.505 0.722 0.528 0.411 0.497 ... > $ MDC1 : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... > $ MLH1 : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... > $ MS4A2: num 0.83 0.853 0.835 0.716 0.481 ... > $ MSH2 : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... > $ PARP1: num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... > $ SSUH2: num 0.73 0.842 0.794 0.854 0.803 ... > > > str(getProfileData(cgds,GeneList, > + "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450")) > trace: read.table(url, skip = 0, header = TRUE, as.is = TRUE, sep = > "\t", > quote = "") > 'data.frame': 338 obs. of 10 variables: > $ ATM : num 0.019 0.017 0.0168 0.015 0.014 ... > $ ATR : num 0.0356 0.0346 0.0231 0.0275 0.0285 ... > $ DDR2 : num 0.81 0.786 0.596 0.861 0.646 ... > $ HPGDS: num 0.576 0.528 0.703 0.781 0.622 ... > $ MDC1 : num 0.189 0.265 0.201 0.199 0.249 ... > $ MLH1 : num 0.404 0.0192 0.017 0.0124 0.0197 ... > $ MS4A2: num 0.913 0.898 0.937 0.861 0.768 ... > $ MSH2 : num 0.018 0.0184 0.016 0.0145 0.0168 ... > $ PARP1: num 0.0191 0.0195 0.0146 0.0174 0.0181 ... > $ SSUH2: num 0.848 0.874 0.644 0.621 0.652 ... > > Perhaps some option or locale setting is causing input strings to be > interpretted as non-numbers. (If you know all these columns should > be numeric, you could add colClasses=rep("numeric", length(GeneList)) > to the call to read.table. See which entries show up as NA and reread > with colClasses=rep("character",length(GeneList)) to see where they > came from). > > It is almost always better to get the data input correctly rather than > trying > to fix it up latter. If you must convert later, using apply(), which > converts > the data.frame to a matrix with a single class for all columns, often > causes > problems. sapply() may or may not convert its output to a matrix, > depending > on what FUN returns. Use lapply instead, with a function that uses the > class of its input > to decide what to do. DataFrame[] <- lapply(DataFrame, > FUN=function(col)...) > will retain the class, row names, and column names of the data.frame. > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Dec 31, 2014 at 8:24 AM, Karim Mezhoud <[email protected]> wrote: > >> Concretely I request cbioportal through cgsdr package. >> Depending of Cases and Genetic profiles I receive in general data.frame >> with heterogeneous structure. The bad one if the returned data.frame is >> composed by numeric and character columns. in this case numeric columns >> are >> considered as factor. It is the case when I explore/extract information >> from Clinical Data (Age, gender., tumor stage..). In this case I need to >> convert only numeric column and not character ones. I am using >> grep("[0-9]*.[0-9]*",df[,i])!=0 {fun to convert}. >> >> But this heterogeneity comes even with only supposed numeric data.frame >> (gene expression). here an example >> >> >> library(cgdsr) >> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM" >> ,"ATR", "MDC1" ,"PARP1") >> cgds<-CGDS("http://www.cbioportal.org/public-portal/") >> >> str(getProfileData(cgds,GeneList, >> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27")) >> >> str(getProfileData(cgds,GeneList, >> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450")) >> >> >> With my computer I did not find the same structure (numeric vs factor). >> >> Also I need to preserve row and column names ;) >> So I am working to resolve these details depending on data of >> cbioportal... >> >> Thank you >> >> >> Ô__ >> c/ /'_;~~~~kmezhoud >> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >> http://bioinformatics.tn/ >> >> >> >> On Wed, Dec 31, 2014 at 4:37 PM, Karim Mezhoud <[email protected]> >> wrote: >> >> > Many Many Many thanks! >> > it is a demonstrative lesson. I need time to test all examples :) >> > Thank you for your time and support. >> > Happy and Healthy New Year >> > >> > Ô__ >> > c/ /'_;~~~~kmezhoud >> > (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >> > http://bioinformatics.tn/ >> > >> > >> > >> > On Wed, Dec 31, 2014 at 2:38 PM, Martin Morgan <[email protected]> >> > wrote: >> > >> >> On 12/31/2014 12:22 AM, Karim Mezhoud wrote: >> >> >> >>> Thanks, >> >>> It seems for loop spends less time ;) >> >>> >> >>> with >> >>> dim(DataFrame) >> >>> [1] 338 70 >> >>> >> >>> For loop has >> >>> user system elapsed >> >>> 0.012 0.000 0.012 >> >>> >> >>> and apply has >> >>> user system elapsed >> >>> 0.020 0.000 0.021 >> >>> >> >> >> >> The timings are so short that the answer in terms of speed is 'it does >> >> not matter'. >> >> >> >> Here is a selection of approaches >> >> >> >> f0 <- function(df) { >> >> for (i in seq_along(df)) >> >> df[,i] <- as.numeric(df[,i]) >> >> df >> >> } >> >> >> >> f0a <- function(df) { >> >> ## data.frame is a list-of-equal-length vectors; access each >> >> ## column with "[[" >> >> for (i in seq_along(df)) >> >> df[[i]] <- as.numeric(df[[i]]) >> >> df >> >> } >> >> >> >> f0c <- compiler::cmpfun(f0) ## loops sometimes benefit from >> compilation >> >> >> >> f1 <- function(df) >> >> as.data.frame(apply(df, 2, as.numeric)) >> >> >> >> f2 <- function(df) { >> >> ## replace all columns of df with list-of-vectors >> >> df[] <- lapply(df, as.numeric) >> >> df >> >> } >> >> >> >> f3 <- function(df) { >> >> ## coerce to matrix to avoid the explicit loop, use mode<- to >> >> ## change storage of elements >> >> m <- as.matrix(df) >> >> mode(m) <- "numeric" >> >> as.data.frame(m) >> >> } >> >> >> >> f4 <- function(df) { >> >> ## if it's a matrix, why are we returning a data.frame? >> >> m <- as.matrix(df) >> >> mode(m) <- "numeric" >> >> m >> >> } >> >> >> >> f4a <- function(df) >> >> ## unlist to single vector, coerce, then format as matrix >> >> matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df), >> >> dimnames=dimnames(df)) >> >> >> >> It's important to test that different methods return the same result >> >> (perhaps allowing for differences in attributes such as row or column >> >> names). The microbenchmark package repeats timings across multiple >> trials >> >> (default 100 times). >> >> >> >> library(microbenchmark) >> >> test <- function(df) { >> >> stopifnot( >> >> identical(f0(df), f0a(df)), >> >> identical(f0(df), f0c(df)), >> >> identical(f0(df), f1(df)), >> >> identical(f0(df), f2(df)), >> >> identical(f0(df), f3(df)), >> >> identical(as.matrix(f0(df)), f4(df)), >> >> all.equal(f4(df), f4a(df), check.attributes=FALSE)) >> >> microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df), >> >> f4a(df)) >> >> } >> >> >> >> Here are some data sets >> >> >> >> m <- matrix(rnorm(338 * 70), 338) >> >> df <- as.data.frame(m) >> >> dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE) >> >> dff <- as.data.frame(lapply(df, as.character)) >> >> >> >> and results >> >> >> >> > test(df) >> >> Unit: microseconds >> >> expr min lq mean median uq max >> neval >> >> f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281 >> 100 >> >> f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618 >> 100 >> >> f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116 >> 100 >> >> f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229 >> 100 >> >> f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732 >> 100 >> >> f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584 >> 100 >> >> f4(df) 808.593 828.5445 852.2626 847.5355 864.6665 1180.977 >> 100 >> >> f4a(df) 422.657 437.2705 458.9845 455.2470 465.5815 695.443 >> 100 >> >> > test(dfc) >> >> Unit: milliseconds >> >> expr min lq mean median uq max >> neval >> >> f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622 >> >> 100 >> >> f0a(df) 8.095709 8.211116 8.380638 8.289895 8.454948 9.529026 >> 100 >> >> f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766 >> >> 100 >> >> f1(df) 8.227371 8.277147 8.422412 8.331403 8.490411 9.145499 >> 100 >> >> f2(df) 6.907888 7.010828 7.162529 7.147198 7.239048 7.763758 >> 100 >> >> f3(df) 6.608107 6.688232 6.845936 6.792066 6.892635 8.359274 >> 100 >> >> f4(df) 5.859482 5.939680 6.046976 5.993804 6.105388 6.968601 >> 100 >> >> f4a(df) 5.372214 5.460987 5.556687 5.521542 5.614482 6.107081 >> 100 >> >> > test(dff) >> >> Error: identical(f0(df), f1(df)) is not TRUE >> >> >> >> Except when dealing with factors, the use of explicit loops is the >> >> slowest. With factors, matrix-based methods coerce the level labels to >> >> numeric, whereas vector-based methods coerce the underlying codes >> (level >> >> values) of the factor; obviously great care needs to be taken. >> >> >> >> > f0(dff)[1:5, 1:5] >> >> V1 V2 V3 V4 V5 >> >> 1 150 232 294 88 56 >> >> 2 159 8 89 59 10 >> >> 3 132 171 40 205 119 >> >> 4 214 273 26 262 216 >> >> 5 281 49 255 31 233 >> >> > f1(dff)[1:5, 1:5] >> >> V1 V2 V3 V4 V5 >> >> 1 -1.7092463 0.50234009 0.8492982 -0.5636901 -0.38545566 >> >> 2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031 >> >> 3 -1.2915110 -2.46181533 -0.2470108 0.3301129 -1.06810225 >> >> 4 0.3065989 0.89263099 -0.1717432 0.7721411 0.35856334 >> >> 5 0.8795616 -0.43049898 0.4560515 -0.1722099 0.46125149 >> >> >> >> In terms of 'best practice', I would represent my data in the >> appropriate >> >> data structure in the first place (as a matrix of appropriate type, >> rather >> >> than data.frame, so the entire coercion is irrelevant). If faced with a >> >> data.frame with specific columns to coerce I would use the approach >> >> >> >> cidx <- sapply(df, is.character) # index of columns to coerce >> >> df[cidx] <- lapply(df[cidx], as.numeric) >> >> >> >> which seems to be reasonably correct, expressive, compact, and speedy. >> >> >> >> Martin Morgan >> >> >> >> >> >> >> >>> Ô__ >> >>> c/ /'_;~~~~kmezhoud >> >>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >> >>> http://bioinformatics.tn/ >> >>> >> >>> >> >>> >> >>> On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <[email protected]> >> wrote: >> >>> >> >>> >> >>>> On 31-12-2014, at 08:40, Karim Mezhoud <[email protected]> wrote: >> >>>>> >> >>>>> Hi All, >> >>>>> I would like to choice between these two data frame convert. which >> is >> >>>>> faster? >> >>>>> >> >>>>> for(i in 1:ncol(DataFrame)){ >> >>>>> >> >>>>> DataFrame[,i] <- as.numeric(DataFrame[,i]) >> >>>>> } >> >>>>> >> >>>>> >> >>>>> OR >> >>>>> >> >>>>> DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x) >> >>>>> as.numeric(x))) >> >>>>> >> >>>>> >> >>>>> >> >>>> Try it and use system.time. >> >>>> >> >>>> Berend >> >>>> >> >>>> Thanks >> >>>>> Karim >> >>>>> Ô__ >> >>>>> c/ /'_;~~~~kmezhoud >> >>>>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >> >>>>> http://bioinformatics.tn/ >> >>>>> >> >>>>> [[alternative HTML version deleted]] >> >>>>> >> >>>>> ______________________________________________ >> >>>>> [email protected] mailing list -- To UNSUBSCRIBE and more, see >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>>>> PLEASE do read the posting guide >> >>>>> >> >>>> http://www.R-project.org/posting-guide.html >> >>>> >> >>>>> and provide commented, minimal, self-contained, reproducible code. >> >>>>> >> >>>> >> >>>> >> >>>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> ______________________________________________ >> >>> [email protected] mailing list -- To UNSUBSCRIBE and more, see >> >>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>> PLEASE do read the posting guide http://www.R-project.org/ >> >>> posting-guide.html >> >>> and provide commented, minimal, self-contained, reproducible code. >> >>> >> >>> >> >> >> >> -- >> >> Computational Biology / Fred Hutchinson Cancer Research Center >> >> 1100 Fairview Ave. N. >> >> PO Box 19024 Seattle, WA 98109 >> >> >> >> Location: Arnold Building M1 B861 >> >> Phone: (206) 667-2793 >> >> >> > >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> [email protected] mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

