To get a bit more concrete regarding these notions, the leeBamViews package
is in the experimental data archive, a VERY rudimentary illustration of a
workflow rooted in BAM archive files through region specification and read
counting. For the very latest checkin, after running
example(bs1)
we have an ad hoc tabulation of read counts:
bs1> tabulateReads(bs1, "+")
intv1 intv2
start 861250 863000
end 862750 864000
isowt.5 3673 2692
isowt.6 3770 2650
rlp.5 1532 1045
rlp.6 1567 1139
ssr.1 4304 3052
ssr.2 4627 3381
xrn.1 2841 1693
xrn.2 3477 2197
or, by setting as.GRanges, a GRanges-based representation
> tabulateReads(bs1, "+", as.GRanges=TRUE)
GRanges with 2 ranges and 9 elementMetadata values
seqnames ranges strand | name isowt.5 isowt.6
<Rle> <IRanges> <Rle> | <character> <integer> <integer>
[1] Scchr13 [861250, 862750] + | intv1 3673 3770
[2] Scchr13 [863000, 864000] + | intv2 2692 2650
rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2
<integer> <integer> <integer> <integer> <integer> <integer>
[1] 1532 1567 4304 4627 2841 3477
[2] 1045 1139 3052 3381 1693 2197
seqlengths
Scchr13
NA
> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
> metadata(OO)
list()
It seems that we would want more structure in a metadata component to get
closer to the values of ExpressionSet discipline. We would also want some
accommodation of this kind of representation in the downstream packages like
edgeR, DEseq.
> sessionInfo()
R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
x86_64-apple-darwin10.2.0
locale:
[1] C
attached base packages:
[1] stats graphics grDevices datasets tools utils methods
[8] base
other attached packages:
[1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1
[4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74
[7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2
[10] digest_0.4.1
On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]> wrote:
> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
> > [email protected]> wrote:
> >
> >> Following a recent thread, I also have found convenient to store nextgen
> >> data as RangedData instead of ShortRead objects. They require far less
> >> memory and make feasible working with several samples at the same time
> (in
> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
> >> RangedData I haven't struck the upper limit yet).
> >>
> >> I am thinking about taking this idea a step forward: RangedDataList
> allows
> >> storing info from several samples (e.g. IP and control) in a single
> object.
> >> The only problem is RangedDataList does not store information about the
> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. My
> idea
> >> is to define something like a "SequenceSet" class, which would contain a
> >> RangedDataList with the ranges, a phenoData with sample information, and
> >> possibly also information about the experiment (e.g. with the MIAME
> analog
> >> for sequencing, MIASEQE).
> >>
> >> The thing is I don't want to re-invent the wheel. I haven't seen that
> this
> >> is implemented yet, but is someone working on it? Any criticism/ ideas?
> >>
> >>
> > RangedDataList already supports this. See the 'elementMetadata' and
> > 'metadata' slots in the Sequence class.
>
> Hi David et al.,
>
> I've also found the elementMetadata slot excellent for this purpose.
> The ShortRead data objects retain sequence and quality information, this
> information is often not needed after a certain point in the analysis.
>
> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
> GRanges class that is more fastidious about strand information (maybe a
> plus?) and conforms more to an 'I am a rectangular data structure' world
> view. Also the GappedAlignments class for efficiently representing large
> numbers of reads.
>
> Martin
>
> >
> > Michael
> >
> >
> >
> >> Best,
> >>
> >> David
> >>
> >> --
> >> David Rossell, PhD
> >> Manager, Bioinformatics and Biostatistics unit
> >> IRB Barcelona
> >> Tel (+34) 93 402 0217
> >> Fax (+34) 93 402 0257
> >> http://www.irbbarcelona.org/bioinformatics
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing