Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Michael Lawrence Fri, 02 Apr 2010 08:24:16 -0700

On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <[email protected]>wrote:


> To get a bit more concrete regarding these notions, the leeBamViews package
> is in the experimental data archive, a VERY rudimentary illustration of a
> workflow rooted in BAM archive files through region specification and read
> counting.  For the very latest checkin, after running
>
> example(bs1)
>
> we have an ad hoc tabulation of read counts:
>
> bs1> tabulateReads(bs1, "+")
>          intv1  intv2
> start   861250 863000
> end     862750 864000
> isowt.5   3673   2692
> isowt.6   3770   2650
> rlp.5     1532   1045
> rlp.6     1567   1139
> ssr.1     4304   3052
> ssr.2     4627   3381
> xrn.1     2841   1693
> xrn.2     3477   2197
>
> or, by setting as.GRanges, a GRanges-based representation
>
> > tabulateReads(bs1, "+", as.GRanges=TRUE)
> GRanges with 2 ranges and 9 elementMetadata values
>     seqnames           ranges strand |        name   isowt.5   isowt.6
>        <Rle>        <IRanges>  <Rle> | <character> <integer> <integer>
> [1]  Scchr13 [861250, 862750]      + |       intv1      3673      3770
> [2]  Scchr13 [863000, 864000]      + |       intv2      2692      2650
>         rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
>     <integer> <integer> <integer> <integer> <integer> <integer>
> [1]      1532      1567      4304      4627      2841      3477
> [2]      1045      1139      3052      3381      1693      2197
>
> seqlengths
> Scchr13
>      NA
> > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
> > metadata(OO)
> list()
>
> It seems that we would want more structure in a metadata component to get
> closer to the values of ExpressionSet discipline.  We would also want some
> accommodation of this kind of representation in the downstream packages like
> edgeR, DEseq.
>
>
The actual 'metadata' slot was meant to be general, in order to accommodate
all needs. If a particular type of data requires a certain structure, then
additional formal classes may be necessary.  For example, gene expression
RNA-seq may want a featureData equivalent annotating each transcript,
whereas with ChIP-seq data, that sort of structure would make less sense,
short of some additional assumptions.

Michael

> sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
> x86_64-apple-darwin10.2.0
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  tools     utils     methods
> [8] base
>
> other attached packages:
>  [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
>  [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
>  [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
> [10] digest_0.4.1
>
>
> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]> wrote:
>
>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
>> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
>> > [email protected]> wrote:
>> >
>> >> Following a recent thread, I also have found convenient to store
>> nextgen
>> >> data as RangedData instead of ShortRead objects. They require far less
>> >> memory and make feasible working with several samples at the same time
>> (in
>> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
>> >> RangedData I haven't struck the upper limit yet).
>> >>
>> >> I am thinking about taking this idea a step forward: RangedDataList
>> allows
>> >> storing info from several samples (e.g. IP and control) in a single
>> object.
>> >> The only problem is RangedDataList does not store information about the
>> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. My
>> idea
>> >> is to define something like a "SequenceSet" class, which would contain
>> a
>> >> RangedDataList with the ranges, a phenoData with sample information,
>> and
>> >> possibly also information about the experiment (e.g. with the MIAME
>> analog
>> >> for sequencing, MIASEQE).
>> >>
>> >> The thing is I don't want to re-invent the wheel. I haven't seen that
>> this
>> >> is implemented yet, but is someone working on it? Any criticism/ ideas?
>> >>
>> >>
>> > RangedDataList already supports this. See the 'elementMetadata' and
>> > 'metadata' slots in the Sequence class.
>>
>> Hi David et al.,
>>
>> I've also found the elementMetadata slot excellent for this purpose.
>> The ShortRead data objects retain sequence and quality information, this
>> information is often not needed after a certain point in the analysis.
>>
>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
>> GRanges class that is more fastidious about strand information (maybe a
>> plus?) and conforms more to an 'I am a rectangular data structure' world
>> view. Also the GappedAlignments class for efficiently representing large
>> numbers of reads.
>>
>> Martin
>>
>> >
>> > Michael
>> >
>> >
>> >
>> >> Best,
>> >>
>> >> David
>> >>
>> >> --
>> >> David Rossell, PhD
>> >> Manager, Bioinformatics and Biostatistics unit
>> >> IRB Barcelona
>> >> Tel (+34) 93 402 0217
>> >> Fax (+34) 93 402 0257
>> >> http://www.irbbarcelona.org/bioinformatics
>> >>
>> >>        [[alternative HTML version deleted]]
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> [email protected]
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >>
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > [email protected]
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Reply via email to