On 04/02/2010 08:57 AM, Vincent Carey wrote: > my unfiltered reaction is to keep it in chipseq -- it would be nice for > GenomicRanges to become quite stable and highly generic. some subclassing > of GRanges will doubtless go on, but when the target use case is ChIP-seq > analysis, the fact that chipseq has some analysis tools should not prevent > it from being the incubator for more general structure designs that do not > address these specific analysis approaches. > > if we find that this inhibits reuse we can take some other approach. with > relatively mature focused resource importation facilities now available > there should be no inhibition.
Not sure where to insert my 2 cents into this thread, but wanted to note that ExpressionSet doesn't really provide much guidance about what goes in to phenoData or featureData -- these are tabula rasa for the user to populate at will. This seems to have worked well enough; it is flexible and there has not been a proliferation of classes for the annotation of samples or features for the user or developer to master. Martin > > On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence <[email protected] >> wrote: > >> I've recently taken over the maintenance/development of the chipseq package >> and have plans for a lot of refactoring, including some new formal classes >> for ChIP-seq data. I'm wondering though if 'chipseq' is the best place, >> given that it also includes some specific analytical methods. That's not a >> huge deal, but might GenomicRanges be the place for these high-level >> structures? >> >> >> On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey >> <[email protected]>wrote: >> >>> >>> >>> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence < >>> [email protected]> wrote: >>> >>>> >>>> >>>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey < >>>> [email protected]> wrote: >>>> >>>>> To get a bit more concrete regarding these notions, the leeBamViews >>>>> package is in the experimental data archive, a VERY rudimentary >>>>> illustration >>>>> of a workflow rooted in BAM archive files through region specification and >>>>> read counting. For the very latest checkin, after running >>>>> >>>>> example(bs1) >>>>> >>>>> we have an ad hoc tabulation of read counts: >>>>> >>>>> bs1> tabulateReads(bs1, "+") >>>>> intv1 intv2 >>>>> start 861250 863000 >>>>> end 862750 864000 >>>>> isowt.5 3673 2692 >>>>> isowt.6 3770 2650 >>>>> rlp.5 1532 1045 >>>>> rlp.6 1567 1139 >>>>> ssr.1 4304 3052 >>>>> ssr.2 4627 3381 >>>>> xrn.1 2841 1693 >>>>> xrn.2 3477 2197 >>>>> >>>>> or, by setting as.GRanges, a GRanges-based representation >>>>> >>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) >>>>> GRanges with 2 ranges and 9 elementMetadata values >>>>> seqnames ranges strand | name isowt.5 isowt.6 >>>>> <Rle> <IRanges> <Rle> | <character> <integer> <integer> >>>>> [1] Scchr13 [861250, 862750] + | intv1 3673 3770 >>>>> [2] Scchr13 [863000, 864000] + | intv2 2692 2650 >>>>> rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2 >>>>> <integer> <integer> <integer> <integer> <integer> <integer> >>>>> [1] 1532 1567 4304 4627 2841 3477 >>>>> [2] 1045 1139 3052 3381 1693 2197 >>>>> >>>>> seqlengths >>>>> Scchr13 >>>>> NA >>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO >>>>>> metadata(OO) >>>>> list() >>>>> >>>>> It seems that we would want more structure in a metadata component to >>>>> get closer to the values of ExpressionSet discipline. We would also want >>>>> some accommodation of this kind of representation in the downstream >>>>> packages >>>>> like edgeR, DEseq. >>>>> >>>>> >>>> The actual 'metadata' slot was meant to be general, in order to >>>> accommodate all needs. If a particular type of data requires a certain >>>> structure, then additional formal classes may be necessary. For example, >>>> gene expression RNA-seq may want a featureData equivalent annotating each >>>> transcript, whereas with ChIP-seq data, that sort of structure would make >>>> less sense, short of some additional assumptions. >>>> >>> >>> I agree completely. Our task is to think/experiment about how to suitably >>> specialize these structures for most effective downstream use. Reuse by >>> multiple downstream toolchains would be great. >>> >>> >> >>>> Michael >>>> >>>>> sessionInfo() >>>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388) >>>>> x86_64-apple-darwin10.2.0 >>>>> >>>>> locale: >>>>> [1] C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices datasets tools utils methods >>>>> >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1 >>>>> [4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74 >>>>> [7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2 >>>>> [10] digest_0.4.1 >>>>> >>>>> >>>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]>wrote: >>>>> >>>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote: >>>>>>> On Wed, Mar 31, 2010 at 3:55 AM, David Rossell < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Following a recent thread, I also have found convenient to store >>>>>> nextgen >>>>>>>> data as RangedData instead of ShortRead objects. They require far >>>>>> less >>>>>>>> memory and make feasible working with several samples at the same >>>>>> time (in >>>>>>>> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with >>>>>>>> RangedData I haven't struck the upper limit yet). >>>>>>>> >>>>>>>> I am thinking about taking this idea a step forward: RangedDataList >>>>>> allows >>>>>>>> storing info from several samples (e.g. IP and control) in a single >>>>>> object. >>>>>>>> The only problem is RangedDataList does not store information about >>>>>> the >>>>>>>> samples, e.g. the phenoData we're used to in ExpressionSet objects. >>>>>> My idea >>>>>>>> is to define something like a "SequenceSet" class, which would >>>>>> contain a >>>>>>>> RangedDataList with the ranges, a phenoData with sample information, >>>>>> and >>>>>>>> possibly also information about the experiment (e.g. with the MIAME >>>>>> analog >>>>>>>> for sequencing, MIASEQE). >>>>>>>> >>>>>>>> The thing is I don't want to re-invent the wheel. I haven't seen >>>>>> that this >>>>>>>> is implemented yet, but is someone working on it? Any criticism/ >>>>>> ideas? >>>>>>>> >>>>>>>> >>>>>>> RangedDataList already supports this. See the 'elementMetadata' and >>>>>>> 'metadata' slots in the Sequence class. >>>>>> >>>>>> Hi David et al., >>>>>> >>>>>> I've also found the elementMetadata slot excellent for this purpose. >>>>>> The ShortRead data objects retain sequence and quality information, >>>>>> this >>>>>> information is often not needed after a certain point in the analysis. >>>>>> >>>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a >>>>>> GRanges class that is more fastidious about strand information (maybe a >>>>>> plus?) and conforms more to an 'I am a rectangular data structure' >>>>>> world >>>>>> view. Also the GappedAlignments class for efficiently representing >>>>>> large >>>>>> numbers of reads. >>>>>> >>>>>> Martin >>>>>> >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> David >>>>>>>> >>>>>>>> -- >>>>>>>> David Rossell, PhD >>>>>>>> Manager, Bioinformatics and Biostatistics unit >>>>>>>> IRB Barcelona >>>>>>>> Tel (+34) 93 402 0217 >>>>>>>> Fax (+34) 93 402 0257 >>>>>>>> http://www.irbbarcelona.org/bioinformatics >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioc-sig-sequencing mailing list >>>>>>>> [email protected] >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioc-sig-sequencing mailing list >>>>>>> [email protected] >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>> >>>>>> >>>>>> -- >>>>>> Martin Morgan >>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N. >>>>>> PO Box 19024 Seattle, WA 98109 >>>>>> >>>>>> Location: Arnold Building M1 B861 >>>>>> Phone: (206) 667-2793 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioc-sig-sequencing mailing list >>>>>> [email protected] >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>> >>>>> >>>>> >>>> >>> >> > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
