2016-04-16 22:55 GMT+02:00 Martin Morgan <martin.mor...@roswellpark.org>:
> > > On 04/16/2016 01:09 PM, Marcin Kosiński wrote: > >> Hello, >> >> I would like to ask you all for an advice in the following issue. >> >> Last year I have started working with data from The Cancer Genome Atlas. >> During that work out team (https://github.com/orgs/RTCGA/people) have >> prepared some tools for downloading and integrating datasets from TCGA >> study and provided them in the R package called RTCGA >> <https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which >> is >> available on Bioconductor. >> >> Later on we were working on tools for visualizing and analyzing the most >> popular datasets from TCGA so we have prepared data packages with those >> datasets and submitted them to Bioconductor in 8 separate packages. You >> can >> read more about them here http://rtcga.github.io/RTCGA/ >> >> *I have a question about updating those data packages.* TCGA release >> datasets snapshots over time. In the RTCGA family of R packages there are >> available datasets from the release date 2015-11-01 but currently one can >> check that there was newer release 2016-01-28 >> >> tail(RTCGA::checkTCGA('Dates')) >>> >> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01" >> "2016-01-28" >> >> I am wondering whether should we upload newer datasets to those data >> packages. We have found that there are great differences in results of >> data >> analysis depending on from which release date one has took datasets. More >> about this issue can be found here: >> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata >> >> The current state of RTCGA family of R packages is listed below >> >> RTCGA.clinical >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html >> > >> - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0 >> - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.1.0 >> > > > >> RTCGA.rnaseq >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html >> > >> - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0 >> - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0 >> >> RTCGA.mutations >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html >> > >> - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0 >> - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0 >> >> --------------------------------------------------- >> >> RTCGA.methylation >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html >> > >> - BiocRelease: NOT YET AVAILABLE >> - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1 >> >> >> RTCGA.CNV >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html >> > >> - BiocRelease: NOT YET AVAILABLE >> - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5 >> >> >> RTCGA.RPPA >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html >> > >> - BiocRelease: NOT YET AVAILABLE >> - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6 >> >> >> RTCGA.mRNA >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html >> > >> - BiocRelease: NOT YET AVAILABLE >> - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3 >> >> >> RTCGA.miRNASeq >> < >> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html >> > >> - BiocRelease: NOT YET AVAILABLE >> - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4 >> >> >> I think that having datasets from the newest snapshot date is vital for >> data analysis, but I wouldn't like to create situations in which 2 >> separate >> analysts use RTCGA.clinical and got different results because they used >> different data versions. That's why I have started versioning data >> packages >> with the number that corresponds to the release date. >> > > This isn't very helpful. There is only ever one version of > 'RTCGA.clinical' available per Bioc version, so whether its version is > 20151101.1.0 or 1.1.0 wouldn't make a difference to the end user. > > Probably you want to include the TCGA release in the package _name_, > 'RTCGA.clinical.20151101'. Probably you want to have multiple versions > available at any one time. > Thanks for comments. I haven't considered making separate packages for separate data releases. > > I don't think the experiment data archive is the best solution for > distributing large collections of curated data. It places a burden on our > mirrors to sync the repository and on the svn repository to store it. The > packages are built twice weekly even though the data is very static and in > your case based on unchanging base R data structures. The data are not very > 'granular', even though you've done a good job of making the individual > data sets accessible, so a user interested in ovarian cancers, say, would > need to download all data anyway. > > Instead I think that these should be ExperimentHub resources. How to add > resources is described in the vignette to the companion package > ExperimentHubData > > http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html > > The data would be stored in Amazon S3 so globally accessible; it would not > be under version control. The ExperimentHub / AnnotationHub cache would > manage local versions, rather than R's package system. > > ExperimentHub will be back in active development, including addition of > new resources, immediately after our next release, May 4, so the timing is > fairly good. > Thanks for letting me know. I wasn't aware about such solution. I'll have a better look at those ExperimentHubs. > > I think it is also worth while to discuss how you have chosen to represent > each of the data types, for instance the RNAseq data as a samples x genes > data.frame whereas the Bioconductor convention would store it primarily as > a genes x sample matrix embedded in a SummarizedExperiment (or at least > make it available to the user in that form; there are definitely advantages > to keeping the serialized instance as simple as possible). > > I've been informed about Bioconductor structures. There is additional function RTCGA::convertTCGA (in devel) that transpoze expression data sets (rnaseq, miRNASeq, mRNA, methylation, etc) and embs them in ExpressionSet https://github.com/RTCGA/RTCGA/blob/master/R/convertTCGA.R#L116-L122 Marcin Kosiński, RTCGA > Martin Morgan > Biocondcutor > > >> What do you think about such an issue? You can post advices here or on our >> issue list: https://github.com/RTCGA/RTCGA/issues >> >> Thanks for comments, >> Marcin >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > This email message may contain legally privileged and/or confidential > information. If you are not the intended recipient(s), or the employee or > agent responsible for the delivery of this message to the intended > recipient(s), you are hereby notified that any disclosure, copying, > distribution, or use of this email message is prohibited. If you have > received this message in error, please notify the sender immediately by > e-mail and delete this email message from your computer. Thank you. > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel