Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Vincent Carey Fri, 02 Apr 2010 08:31:56 -0700

On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence <[email protected]
> wrote:


>
>
> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey 
> <[email protected]>wrote:
>
>> To get a bit more concrete regarding these notions, the leeBamViews
>> package is in the experimental data archive, a VERY rudimentary illustration
>> of a workflow rooted in BAM archive files through region specification and
>> read counting.  For the very latest checkin, after running
>>
>> example(bs1)
>>
>> we have an ad hoc tabulation of read counts:
>>
>> bs1> tabulateReads(bs1, "+")
>>          intv1  intv2
>> start   861250 863000
>> end     862750 864000
>> isowt.5   3673   2692
>> isowt.6   3770   2650
>> rlp.5     1532   1045
>> rlp.6     1567   1139
>> ssr.1     4304   3052
>> ssr.2     4627   3381
>> xrn.1     2841   1693
>> xrn.2     3477   2197
>>
>> or, by setting as.GRanges, a GRanges-based representation
>>
>> > tabulateReads(bs1, "+", as.GRanges=TRUE)
>> GRanges with 2 ranges and 9 elementMetadata values
>>     seqnames           ranges strand |        name   isowt.5   isowt.6
>>        <Rle>        <IRanges>  <Rle> | <character> <integer> <integer>
>> [1]  Scchr13 [861250, 862750]      + |       intv1      3673      3770
>> [2]  Scchr13 [863000, 864000]      + |       intv2      2692      2650
>>         rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
>>     <integer> <integer> <integer> <integer> <integer> <integer>
>> [1]      1532      1567      4304      4627      2841      3477
>> [2]      1045      1139      3052      3381      1693      2197
>>
>> seqlengths
>> Scchr13
>>      NA
>> > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
>> > metadata(OO)
>> list()
>>
>> It seems that we would want more structure in a metadata component to get
>> closer to the values of ExpressionSet discipline.  We would also want some
>> accommodation of this kind of representation in the downstream packages like
>> edgeR, DEseq.
>>
>>
> The actual 'metadata' slot was meant to be general, in order to accommodate
> all needs. If a particular type of data requires a certain structure, then
> additional formal classes may be necessary.  For example, gene expression
> RNA-seq may want a featureData equivalent annotating each transcript,
> whereas with ChIP-seq data, that sort of structure would make less sense,
> short of some additional assumptions.
>

I agree completely.  Our task is to think/experiment about how to suitably
specialize these structures for most effective downstream use.  Reuse by
multiple downstream toolchains would be great.


> Michael
>
> > sessionInfo()
>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
>> x86_64-apple-darwin10.2.0
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>> [8] base
>>
>> other attached packages:
>>  [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
>>  [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
>>  [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
>> [10] digest_0.4.1
>>
>>
>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]>wrote:
>>
>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
>>> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
>>> > [email protected]> wrote:
>>> >
>>> >> Following a recent thread, I also have found convenient to store
>>> nextgen
>>> >> data as RangedData instead of ShortRead objects. They require far less
>>> >> memory and make feasible working with several samples at the same time
>>> (in
>>> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
>>> >> RangedData I haven't struck the upper limit yet).
>>> >>
>>> >> I am thinking about taking this idea a step forward: RangedDataList
>>> allows
>>> >> storing info from several samples (e.g. IP and control) in a single
>>> object.
>>> >> The only problem is RangedDataList does not store information about
>>> the
>>> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. My
>>> idea
>>> >> is to define something like a "SequenceSet" class, which would contain
>>> a
>>> >> RangedDataList with the ranges, a phenoData with sample information,
>>> and
>>> >> possibly also information about the experiment (e.g. with the MIAME
>>> analog
>>> >> for sequencing, MIASEQE).
>>> >>
>>> >> The thing is I don't want to re-invent the wheel. I haven't seen that
>>> this
>>> >> is implemented yet, but is someone working on it? Any criticism/
>>> ideas?
>>> >>
>>> >>
>>> > RangedDataList already supports this. See the 'elementMetadata' and
>>> > 'metadata' slots in the Sequence class.
>>>
>>> Hi David et al.,
>>>
>>> I've also found the elementMetadata slot excellent for this purpose.
>>> The ShortRead data objects retain sequence and quality information, this
>>> information is often not needed after a certain point in the analysis.
>>>
>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
>>> GRanges class that is more fastidious about strand information (maybe a
>>> plus?) and conforms more to an 'I am a rectangular data structure' world
>>> view. Also the GappedAlignments class for efficiently representing large
>>> numbers of reads.
>>>
>>> Martin
>>>
>>> >
>>> > Michael
>>> >
>>> >
>>> >
>>> >> Best,
>>> >>
>>> >> David
>>> >>
>>> >> --
>>> >> David Rossell, PhD
>>> >> Manager, Bioinformatics and Biostatistics unit
>>> >> IRB Barcelona
>>> >> Tel (+34) 93 402 0217
>>> >> Fax (+34) 93 402 0257
>>> >> http://www.irbbarcelona.org/bioinformatics
>>> >>
>>> >>        [[alternative HTML version deleted]]
>>> >>
>>> >> _______________________________________________
>>> >> Bioc-sig-sequencing mailing list
>>> >> [email protected]
>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> >>
>>> >
>>> >       [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioc-sig-sequencing mailing list
>>> > [email protected]
>>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Reply via email to