Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Vincent Carey Fri, 02 Apr 2010 07:56:04 -0700

To get a bit more concrete regarding these notions, the leeBamViews package
is in the experimental data archive, a VERY rudimentary illustration of a
workflow rooted in BAM archive files through region specification and read
counting.  For the very latest checkin, after running


example(bs1)

we have an ad hoc tabulation of read counts:

bs1> tabulateReads(bs1, "+")
         intv1  intv2
start   861250 863000
end     862750 864000
isowt.5   3673   2692
isowt.6   3770   2650
rlp.5     1532   1045
rlp.6     1567   1139
ssr.1     4304   3052
ssr.2     4627   3381
xrn.1     2841   1693
xrn.2     3477   2197

or, by setting as.GRanges, a GRanges-based representation

> tabulateReads(bs1, "+", as.GRanges=TRUE)
GRanges with 2 ranges and 9 elementMetadata values
    seqnames           ranges strand |        name   isowt.5   isowt.6
       <Rle>        <IRanges>  <Rle> | <character> <integer> <integer>
[1]  Scchr13 [861250, 862750]      + |       intv1      3673      3770
[2]  Scchr13 [863000, 864000]      + |       intv2      2692      2650
        rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
    <integer> <integer> <integer> <integer> <integer> <integer>
[1]      1532      1567      4304      4627      2841      3477
[2]      1045      1139      3052      3381      1693      2197

seqlengths
Scchr13
     NA
> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
> metadata(OO)
list()

It seems that we would want more structure in a metadata component to get
closer to the values of ExpressionSet discipline.  We would also want some
accommodation of this kind of representation in the downstream packages like
edgeR, DEseq.

> sessionInfo()
R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
x86_64-apple-darwin10.2.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices datasets  tools     utils     methods
[8] base

other attached packages:
 [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
 [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
 [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
[10] digest_0.4.1


On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]> wrote:

> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
> > [email protected]> wrote:
> >
> >> Following a recent thread, I also have found convenient to store nextgen
> >> data as RangedData instead of ShortRead objects. They require far less
> >> memory and make feasible working with several samples at the same time
> (in
> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
> >> RangedData I haven't struck the upper limit yet).
> >>
> >> I am thinking about taking this idea a step forward: RangedDataList
> allows
> >> storing info from several samples (e.g. IP and control) in a single
> object.
> >> The only problem is RangedDataList does not store information about the
> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. My
> idea
> >> is to define something like a "SequenceSet" class, which would contain a
> >> RangedDataList with the ranges, a phenoData with sample information, and
> >> possibly also information about the experiment (e.g. with the MIAME
> analog
> >> for sequencing, MIASEQE).
> >>
> >> The thing is I don't want to re-invent the wheel. I haven't seen that
> this
> >> is implemented yet, but is someone working on it? Any criticism/ ideas?
> >>
> >>
> > RangedDataList already supports this. See the 'elementMetadata' and
> > 'metadata' slots in the Sequence class.
>
> Hi David et al.,
>
> I've also found the elementMetadata slot excellent for this purpose.
> The ShortRead data objects retain sequence and quality information, this
> information is often not needed after a certain point in the analysis.
>
> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
> GRanges class that is more fastidious about strand information (maybe a
> plus?) and conforms more to an 'I am a rectangular data structure' world
> view. Also the GappedAlignments class for efficiently representing large
> numbers of reads.
>
> Martin
>
> >
> > Michael
> >
> >
> >
> >> Best,
> >>
> >> David
> >>
> >> --
> >> David Rossell, PhD
> >> Manager, Bioinformatics and Biostatistics unit
> >> IRB Barcelona
> >> Tel (+34) 93 402 0217
> >> Fax (+34) 93 402 0257
> >> http://www.irbbarcelona.org/bioinformatics
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Reply via email to