On Fri, Apr 2, 2010 at 12:55 PM, Martin Morgan <[email protected]> wrote:
> On 04/02/2010 08:57 AM, Vincent Carey wrote: > > my unfiltered reaction is to keep it in chipseq -- it would be nice for > > GenomicRanges to become quite stable and highly generic. some > subclassing > > of GRanges will doubtless go on, but when the target use case is ChIP-seq > > analysis, the fact that chipseq has some analysis tools should not > prevent > > it from being the incubator for more general structure designs that do > not > > address these specific analysis approaches. > > > > if we find that this inhibits reuse we can take some other approach. > with > > relatively mature focused resource importation facilities now available > > there should be no inhibition. > > Not sure where to insert my 2 cents into this thread, but wanted to note > that ExpressionSet doesn't really provide much guidance about what goes > in to phenoData or featureData -- these are tabula rasa for the user to > populate at will. This seems to have worked well enough; it is flexible > and there has not been a proliferation of classes for the annotation of > samples or features for the user or developer to master. > It is true that contents of these components are ad libitum, but their structures are suggestive guides. Getting pData(), varMetadata(), experimentData() to work and to perform useful tasks for expression archives has been accomplished, but the latter two are underutilized. The lesson to be drawn is perhaps clear enough. I believe something richer than a list will eventually be desired as a metadata container for the kinds of information we are discussing. Until we have a clear sense of what enrichments will pay off, it can stay as a list. Let's not forget that tight binding of sample-level data to assay data in the expression array domain has paid off. We may be taking too lax a view of this in sequencing because there are so few samples on hand in many applications. Clearly the count data that I illustrated just a little while ago, in a GRanges container, could be managed in an eSet extension -- in fact DESeq has such an extension. This gives us a lot of mature functionality regarding metadata and parallel handling of metadata and assay data at the sample level, for free. We should consider how GRanges and eSet should interact, at least conceptually, if we want to be sure not to sacrifice investments in the eSet design. > Martin > > > > > > On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence < > [email protected] > >> wrote: > > > >> I've recently taken over the maintenance/development of the chipseq > package > >> and have plans for a lot of refactoring, including some new formal > classes > >> for ChIP-seq data. I'm wondering though if 'chipseq' is the best place, > >> given that it also includes some specific analytical methods. That's not > a > >> huge deal, but might GenomicRanges be the place for these high-level > >> structures? > >> > >> > >> On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey < > [email protected]>wrote: > >> > >>> > >>> > >>> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence < > >>> [email protected]> wrote: > >>> > >>>> > >>>> > >>>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey < > >>>> [email protected]> wrote: > >>>> > >>>>> To get a bit more concrete regarding these notions, the leeBamViews > >>>>> package is in the experimental data archive, a VERY rudimentary > illustration > >>>>> of a workflow rooted in BAM archive files through region > specification and > >>>>> read counting. For the very latest checkin, after running > >>>>> > >>>>> example(bs1) > >>>>> > >>>>> we have an ad hoc tabulation of read counts: > >>>>> > >>>>> bs1> tabulateReads(bs1, "+") > >>>>> intv1 intv2 > >>>>> start 861250 863000 > >>>>> end 862750 864000 > >>>>> isowt.5 3673 2692 > >>>>> isowt.6 3770 2650 > >>>>> rlp.5 1532 1045 > >>>>> rlp.6 1567 1139 > >>>>> ssr.1 4304 3052 > >>>>> ssr.2 4627 3381 > >>>>> xrn.1 2841 1693 > >>>>> xrn.2 3477 2197 > >>>>> > >>>>> or, by setting as.GRanges, a GRanges-based representation > >>>>> > >>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) > >>>>> GRanges with 2 ranges and 9 elementMetadata values > >>>>> seqnames ranges strand | name isowt.5 > isowt.6 > >>>>> <Rle> <IRanges> <Rle> | <character> <integer> > <integer> > >>>>> [1] Scchr13 [861250, 862750] + | intv1 3673 > 3770 > >>>>> [2] Scchr13 [863000, 864000] + | intv2 2692 > 2650 > >>>>> rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2 > >>>>> <integer> <integer> <integer> <integer> <integer> <integer> > >>>>> [1] 1532 1567 4304 4627 2841 3477 > >>>>> [2] 1045 1139 3052 3381 1693 2197 > >>>>> > >>>>> seqlengths > >>>>> Scchr13 > >>>>> NA > >>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO > >>>>>> metadata(OO) > >>>>> list() > >>>>> > >>>>> It seems that we would want more structure in a metadata component to > >>>>> get closer to the values of ExpressionSet discipline. We would also > want > >>>>> some accommodation of this kind of representation in the downstream > packages > >>>>> like edgeR, DEseq. > >>>>> > >>>>> > >>>> The actual 'metadata' slot was meant to be general, in order to > >>>> accommodate all needs. If a particular type of data requires a certain > >>>> structure, then additional formal classes may be necessary. For > example, > >>>> gene expression RNA-seq may want a featureData equivalent annotating > each > >>>> transcript, whereas with ChIP-seq data, that sort of structure would > make > >>>> less sense, short of some additional assumptions. > >>>> > >>> > >>> I agree completely. Our task is to think/experiment about how to > suitably > >>> specialize these structures for most effective downstream use. Reuse > by > >>> multiple downstream toolchains would be great. > >>> > >>> > >> > >>>> Michael > >>>> > >>>>> sessionInfo() > >>>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388) > >>>>> x86_64-apple-darwin10.2.0 > >>>>> > >>>>> locale: > >>>>> [1] C > >>>>> > >>>>> attached base packages: > >>>>> [1] stats graphics grDevices datasets tools utils > methods > >>>>> > >>>>> [8] base > >>>>> > >>>>> other attached packages: > >>>>> [1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1 > >>>>> [4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74 > >>>>> [7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2 > >>>>> [10] digest_0.4.1 > >>>>> > >>>>> > >>>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected] > >wrote: > >>>>> > >>>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote: > >>>>>>> On Wed, Mar 31, 2010 at 3:55 AM, David Rossell < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Following a recent thread, I also have found convenient to store > >>>>>> nextgen > >>>>>>>> data as RangedData instead of ShortRead objects. They require far > >>>>>> less > >>>>>>>> memory and make feasible working with several samples at the same > >>>>>> time (in > >>>>>>>> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, > with > >>>>>>>> RangedData I haven't struck the upper limit yet). > >>>>>>>> > >>>>>>>> I am thinking about taking this idea a step forward: > RangedDataList > >>>>>> allows > >>>>>>>> storing info from several samples (e.g. IP and control) in a > single > >>>>>> object. > >>>>>>>> The only problem is RangedDataList does not store information > about > >>>>>> the > >>>>>>>> samples, e.g. the phenoData we're used to in ExpressionSet > objects. > >>>>>> My idea > >>>>>>>> is to define something like a "SequenceSet" class, which would > >>>>>> contain a > >>>>>>>> RangedDataList with the ranges, a phenoData with sample > information, > >>>>>> and > >>>>>>>> possibly also information about the experiment (e.g. with the > MIAME > >>>>>> analog > >>>>>>>> for sequencing, MIASEQE). > >>>>>>>> > >>>>>>>> The thing is I don't want to re-invent the wheel. I haven't seen > >>>>>> that this > >>>>>>>> is implemented yet, but is someone working on it? Any criticism/ > >>>>>> ideas? > >>>>>>>> > >>>>>>>> > >>>>>>> RangedDataList already supports this. See the 'elementMetadata' and > >>>>>>> 'metadata' slots in the Sequence class. > >>>>>> > >>>>>> Hi David et al., > >>>>>> > >>>>>> I've also found the elementMetadata slot excellent for this purpose. > >>>>>> The ShortRead data objects retain sequence and quality information, > >>>>>> this > >>>>>> information is often not needed after a certain point in the > analysis. > >>>>>> > >>>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which > has a > >>>>>> GRanges class that is more fastidious about strand information > (maybe a > >>>>>> plus?) and conforms more to an 'I am a rectangular data structure' > >>>>>> world > >>>>>> view. Also the GappedAlignments class for efficiently representing > >>>>>> large > >>>>>> numbers of reads. > >>>>>> > >>>>>> Martin > >>>>>> > >>>>>>> > >>>>>>> Michael > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> Best, > >>>>>>>> > >>>>>>>> David > >>>>>>>> > >>>>>>>> -- > >>>>>>>> David Rossell, PhD > >>>>>>>> Manager, Bioinformatics and Biostatistics unit > >>>>>>>> IRB Barcelona > >>>>>>>> Tel (+34) 93 402 0217 > >>>>>>>> Fax (+34) 93 402 0257 > >>>>>>>> http://www.irbbarcelona.org/bioinformatics > >>>>>>>> > >>>>>>>> [[alternative HTML version deleted]] > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Bioc-sig-sequencing mailing list > >>>>>>>> [email protected] > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >>>>>>>> > >>>>>>> > >>>>>>> [[alternative HTML version deleted]] > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Bioc-sig-sequencing mailing list > >>>>>>> [email protected] > >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Martin Morgan > >>>>>> Computational Biology / Fred Hutchinson Cancer Research Center > >>>>>> 1100 Fairview Ave. N. > >>>>>> PO Box 19024 Seattle, WA 98109 > >>>>>> > >>>>>> Location: Arnold Building M1 B861 > >>>>>> Phone: (206) 667-2793 > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioc-sig-sequencing mailing list > >>>>>> [email protected] > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
