my unfiltered reaction is to keep it in chipseq -- it would be nice for GenomicRanges to become quite stable and highly generic. some subclassing of GRanges will doubtless go on, but when the target use case is ChIP-seq analysis, the fact that chipseq has some analysis tools should not prevent it from being the incubator for more general structure designs that do not address these specific analysis approaches.
if we find that this inhibits reuse we can take some other approach. with relatively mature focused resource importation facilities now available there should be no inhibition. On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence <[email protected] > wrote: > I've recently taken over the maintenance/development of the chipseq package > and have plans for a lot of refactoring, including some new formal classes > for ChIP-seq data. I'm wondering though if 'chipseq' is the best place, > given that it also includes some specific analytical methods. That's not a > huge deal, but might GenomicRanges be the place for these high-level > structures? > > > On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey > <[email protected]>wrote: > >> >> >> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence < >> [email protected]> wrote: >> >>> >>> >>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey < >>> [email protected]> wrote: >>> >>>> To get a bit more concrete regarding these notions, the leeBamViews >>>> package is in the experimental data archive, a VERY rudimentary >>>> illustration >>>> of a workflow rooted in BAM archive files through region specification and >>>> read counting. For the very latest checkin, after running >>>> >>>> example(bs1) >>>> >>>> we have an ad hoc tabulation of read counts: >>>> >>>> bs1> tabulateReads(bs1, "+") >>>> intv1 intv2 >>>> start 861250 863000 >>>> end 862750 864000 >>>> isowt.5 3673 2692 >>>> isowt.6 3770 2650 >>>> rlp.5 1532 1045 >>>> rlp.6 1567 1139 >>>> ssr.1 4304 3052 >>>> ssr.2 4627 3381 >>>> xrn.1 2841 1693 >>>> xrn.2 3477 2197 >>>> >>>> or, by setting as.GRanges, a GRanges-based representation >>>> >>>> > tabulateReads(bs1, "+", as.GRanges=TRUE) >>>> GRanges with 2 ranges and 9 elementMetadata values >>>> seqnames ranges strand | name isowt.5 isowt.6 >>>> <Rle> <IRanges> <Rle> | <character> <integer> <integer> >>>> [1] Scchr13 [861250, 862750] + | intv1 3673 3770 >>>> [2] Scchr13 [863000, 864000] + | intv2 2692 2650 >>>> rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2 >>>> <integer> <integer> <integer> <integer> <integer> <integer> >>>> [1] 1532 1567 4304 4627 2841 3477 >>>> [2] 1045 1139 3052 3381 1693 2197 >>>> >>>> seqlengths >>>> Scchr13 >>>> NA >>>> > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO >>>> > metadata(OO) >>>> list() >>>> >>>> It seems that we would want more structure in a metadata component to >>>> get closer to the values of ExpressionSet discipline. We would also want >>>> some accommodation of this kind of representation in the downstream >>>> packages >>>> like edgeR, DEseq. >>>> >>>> >>> The actual 'metadata' slot was meant to be general, in order to >>> accommodate all needs. If a particular type of data requires a certain >>> structure, then additional formal classes may be necessary. For example, >>> gene expression RNA-seq may want a featureData equivalent annotating each >>> transcript, whereas with ChIP-seq data, that sort of structure would make >>> less sense, short of some additional assumptions. >>> >> >> I agree completely. Our task is to think/experiment about how to suitably >> specialize these structures for most effective downstream use. Reuse by >> multiple downstream toolchains would be great. >> >> > >>> Michael >>> >>> > sessionInfo() >>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388) >>>> x86_64-apple-darwin10.2.0 >>>> >>>> locale: >>>> [1] C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices datasets tools utils methods >>>> >>>> [8] base >>>> >>>> other attached packages: >>>> [1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1 >>>> [4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74 >>>> [7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2 >>>> [10] digest_0.4.1 >>>> >>>> >>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]>wrote: >>>> >>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote: >>>>> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell < >>>>> > [email protected]> wrote: >>>>> > >>>>> >> Following a recent thread, I also have found convenient to store >>>>> nextgen >>>>> >> data as RangedData instead of ShortRead objects. They require far >>>>> less >>>>> >> memory and make feasible working with several samples at the same >>>>> time (in >>>>> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with >>>>> >> RangedData I haven't struck the upper limit yet). >>>>> >> >>>>> >> I am thinking about taking this idea a step forward: RangedDataList >>>>> allows >>>>> >> storing info from several samples (e.g. IP and control) in a single >>>>> object. >>>>> >> The only problem is RangedDataList does not store information about >>>>> the >>>>> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. >>>>> My idea >>>>> >> is to define something like a "SequenceSet" class, which would >>>>> contain a >>>>> >> RangedDataList with the ranges, a phenoData with sample information, >>>>> and >>>>> >> possibly also information about the experiment (e.g. with the MIAME >>>>> analog >>>>> >> for sequencing, MIASEQE). >>>>> >> >>>>> >> The thing is I don't want to re-invent the wheel. I haven't seen >>>>> that this >>>>> >> is implemented yet, but is someone working on it? Any criticism/ >>>>> ideas? >>>>> >> >>>>> >> >>>>> > RangedDataList already supports this. See the 'elementMetadata' and >>>>> > 'metadata' slots in the Sequence class. >>>>> >>>>> Hi David et al., >>>>> >>>>> I've also found the elementMetadata slot excellent for this purpose. >>>>> The ShortRead data objects retain sequence and quality information, >>>>> this >>>>> information is often not needed after a certain point in the analysis. >>>>> >>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a >>>>> GRanges class that is more fastidious about strand information (maybe a >>>>> plus?) and conforms more to an 'I am a rectangular data structure' >>>>> world >>>>> view. Also the GappedAlignments class for efficiently representing >>>>> large >>>>> numbers of reads. >>>>> >>>>> Martin >>>>> >>>>> > >>>>> > Michael >>>>> > >>>>> > >>>>> > >>>>> >> Best, >>>>> >> >>>>> >> David >>>>> >> >>>>> >> -- >>>>> >> David Rossell, PhD >>>>> >> Manager, Bioinformatics and Biostatistics unit >>>>> >> IRB Barcelona >>>>> >> Tel (+34) 93 402 0217 >>>>> >> Fax (+34) 93 402 0257 >>>>> >> http://www.irbbarcelona.org/bioinformatics >>>>> >> >>>>> >> [[alternative HTML version deleted]] >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Bioc-sig-sequencing mailing list >>>>> >> [email protected] >>>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>> >> >>>>> > >>>>> > [[alternative HTML version deleted]] >>>>> > >>>>> > _______________________________________________ >>>>> > Bioc-sig-sequencing mailing list >>>>> > [email protected] >>>>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>> >>>>> >>>>> -- >>>>> Martin Morgan >>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N. >>>>> PO Box 19024 Seattle, WA 98109 >>>>> >>>>> Location: Arnold Building M1 B861 >>>>> Phone: (206) 667-2793 >>>>> >>>>> _______________________________________________ >>>>> Bioc-sig-sequencing mailing list >>>>> [email protected] >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>> >>>> >>>> >>> >> > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
