On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <[email protected]>wrote:
> To get a bit more concrete regarding these notions, the leeBamViews package > is in the experimental data archive, a VERY rudimentary illustration of a > workflow rooted in BAM archive files through region specification and read > counting. For the very latest checkin, after running > > example(bs1) > > we have an ad hoc tabulation of read counts: > > bs1> tabulateReads(bs1, "+") > intv1 intv2 > start 861250 863000 > end 862750 864000 > isowt.5 3673 2692 > isowt.6 3770 2650 > rlp.5 1532 1045 > rlp.6 1567 1139 > ssr.1 4304 3052 > ssr.2 4627 3381 > xrn.1 2841 1693 > xrn.2 3477 2197 > > or, by setting as.GRanges, a GRanges-based representation > > > tabulateReads(bs1, "+", as.GRanges=TRUE) > GRanges with 2 ranges and 9 elementMetadata values > seqnames ranges strand | name isowt.5 isowt.6 > <Rle> <IRanges> <Rle> | <character> <integer> <integer> > [1] Scchr13 [861250, 862750] + | intv1 3673 3770 > [2] Scchr13 [863000, 864000] + | intv2 2692 2650 > rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2 > <integer> <integer> <integer> <integer> <integer> <integer> > [1] 1532 1567 4304 4627 2841 3477 > [2] 1045 1139 3052 3381 1693 2197 > > seqlengths > Scchr13 > NA > > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO > > metadata(OO) > list() > > It seems that we would want more structure in a metadata component to get > closer to the values of ExpressionSet discipline. We would also want some > accommodation of this kind of representation in the downstream packages like > edgeR, DEseq. > > The actual 'metadata' slot was meant to be general, in order to accommodate all needs. If a particular type of data requires a certain structure, then additional formal classes may be necessary. For example, gene expression RNA-seq may want a featureData equivalent annotating each transcript, whereas with ChIP-seq data, that sort of structure would make less sense, short of some additional assumptions. Michael > sessionInfo() > R version 2.11.0 Under development (unstable) (2010-03-24 r51388) > x86_64-apple-darwin10.2.0 > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets tools utils methods > [8] base > > other attached packages: > [1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1 > [4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74 > [7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2 > [10] digest_0.4.1 > > > On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]> wrote: > >> On 03/31/2010 04:06 AM, Michael Lawrence wrote: >> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell < >> > [email protected]> wrote: >> > >> >> Following a recent thread, I also have found convenient to store >> nextgen >> >> data as RangedData instead of ShortRead objects. They require far less >> >> memory and make feasible working with several samples at the same time >> (in >> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with >> >> RangedData I haven't struck the upper limit yet). >> >> >> >> I am thinking about taking this idea a step forward: RangedDataList >> allows >> >> storing info from several samples (e.g. IP and control) in a single >> object. >> >> The only problem is RangedDataList does not store information about the >> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. My >> idea >> >> is to define something like a "SequenceSet" class, which would contain >> a >> >> RangedDataList with the ranges, a phenoData with sample information, >> and >> >> possibly also information about the experiment (e.g. with the MIAME >> analog >> >> for sequencing, MIASEQE). >> >> >> >> The thing is I don't want to re-invent the wheel. I haven't seen that >> this >> >> is implemented yet, but is someone working on it? Any criticism/ ideas? >> >> >> >> >> > RangedDataList already supports this. See the 'elementMetadata' and >> > 'metadata' slots in the Sequence class. >> >> Hi David et al., >> >> I've also found the elementMetadata slot excellent for this purpose. >> The ShortRead data objects retain sequence and quality information, this >> information is often not needed after a certain point in the analysis. >> >> Wanted to point to the GenomicRanges package in Bioc-devel, which has a >> GRanges class that is more fastidious about strand information (maybe a >> plus?) and conforms more to an 'I am a rectangular data structure' world >> view. Also the GappedAlignments class for efficiently representing large >> numbers of reads. >> >> Martin >> >> > >> > Michael >> > >> > >> > >> >> Best, >> >> >> >> David >> >> >> >> -- >> >> David Rossell, PhD >> >> Manager, Bioinformatics and Biostatistics unit >> >> IRB Barcelona >> >> Tel (+34) 93 402 0217 >> >> Fax (+34) 93 402 0257 >> >> http://www.irbbarcelona.org/bioinformatics >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> >> Bioc-sig-sequencing mailing list >> >> [email protected] >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioc-sig-sequencing mailing list >> > [email protected] >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> -- >> Martin Morgan >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> > > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
