Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Vincent Carey Fri, 02 Apr 2010 10:28:14 -0700

On Fri, Apr 2, 2010 at 12:55 PM, Martin Morgan <[email protected]> wrote:


> On 04/02/2010 08:57 AM, Vincent Carey wrote:
> > my unfiltered reaction is to keep it in chipseq -- it would be nice for
> > GenomicRanges to become quite stable and highly generic.  some
> subclassing
> > of GRanges will doubtless go on, but when the target use case is ChIP-seq
> > analysis, the fact that chipseq has some analysis tools should not
> prevent
> > it from being the incubator for more general structure designs that do
> not
> > address these specific analysis approaches.
> >
> > if we find that this inhibits reuse we can take some other approach.
>  with
> > relatively mature focused resource importation facilities now available
> > there should be no inhibition.
>
> Not sure where to insert my 2 cents into this thread, but wanted to note
> that ExpressionSet doesn't really provide much guidance about what goes
> in to phenoData or featureData -- these are tabula rasa for the user to
> populate at will. This seems to have worked well enough; it is flexible
> and there has not been a proliferation of classes for the annotation of
> samples or features for the user or developer to master.
>

It is true that contents of these components are ad libitum, but their
structures are suggestive guides.  Getting pData(), varMetadata(),
experimentData() to work and to perform useful tasks for expression archives
has been accomplished, but the latter two are underutilized.  The lesson to
be drawn is perhaps clear enough.  I believe something richer than a list
will eventually be desired as a metadata container for the kinds of
information we are discussing.  Until we have a clear sense of what
enrichments will pay off, it can stay as a list.

Let's not forget that tight binding of sample-level data to assay data in
the expression array domain has paid off.  We may be taking too lax a view
of this in sequencing because there are so few samples on hand in many
applications.

Clearly the count data that I illustrated just a little while ago, in a
GRanges container, could be managed in an eSet extension -- in fact DESeq
has such an extension.  This gives us a lot of mature functionality
regarding metadata and parallel handling of metadata and assay data at the
sample level, for free.  We should consider how GRanges and eSet should
interact, at least conceptually, if we want to be sure not to sacrifice
investments in the eSet design.






> Martin
>
>
> >
> > On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence <
> [email protected]
> >> wrote:
> >
> >> I've recently taken over the maintenance/development of the chipseq
> package
> >> and have plans for a lot of refactoring, including some new formal
> classes
> >> for ChIP-seq data. I'm wondering though if 'chipseq' is the best place,
> >> given that it also includes some specific analytical methods. That's not
> a
> >> huge deal, but might GenomicRanges be the place for these high-level
> >> structures?
> >>
> >>
> >> On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey <
> [email protected]>wrote:
> >>
> >>>
> >>>
> >>> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence <
> >>> [email protected]> wrote:
> >>>
> >>>>
> >>>>
> >>>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> To get a bit more concrete regarding these notions, the leeBamViews
> >>>>> package is in the experimental data archive, a VERY rudimentary
> illustration
> >>>>> of a workflow rooted in BAM archive files through region
> specification and
> >>>>> read counting.  For the very latest checkin, after running
> >>>>>
> >>>>> example(bs1)
> >>>>>
> >>>>> we have an ad hoc tabulation of read counts:
> >>>>>
> >>>>> bs1> tabulateReads(bs1, "+")
> >>>>>          intv1  intv2
> >>>>> start   861250 863000
> >>>>> end     862750 864000
> >>>>> isowt.5   3673   2692
> >>>>> isowt.6   3770   2650
> >>>>> rlp.5     1532   1045
> >>>>> rlp.6     1567   1139
> >>>>> ssr.1     4304   3052
> >>>>> ssr.2     4627   3381
> >>>>> xrn.1     2841   1693
> >>>>> xrn.2     3477   2197
> >>>>>
> >>>>> or, by setting as.GRanges, a GRanges-based representation
> >>>>>
> >>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE)
> >>>>> GRanges with 2 ranges and 9 elementMetadata values
> >>>>>     seqnames           ranges strand |        name   isowt.5
> isowt.6
> >>>>>        <Rle>        <IRanges>  <Rle> | <character> <integer>
> <integer>
> >>>>> [1]  Scchr13 [861250, 862750]      + |       intv1      3673
>  3770
> >>>>> [2]  Scchr13 [863000, 864000]      + |       intv2      2692
>  2650
> >>>>>         rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
> >>>>>     <integer> <integer> <integer> <integer> <integer> <integer>
> >>>>> [1]      1532      1567      4304      4627      2841      3477
> >>>>> [2]      1045      1139      3052      3381      1693      2197
> >>>>>
> >>>>> seqlengths
> >>>>> Scchr13
> >>>>>      NA
> >>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
> >>>>>> metadata(OO)
> >>>>> list()
> >>>>>
> >>>>> It seems that we would want more structure in a metadata component to
> >>>>> get closer to the values of ExpressionSet discipline.  We would also
> want
> >>>>> some accommodation of this kind of representation in the downstream
> packages
> >>>>> like edgeR, DEseq.
> >>>>>
> >>>>>
> >>>> The actual 'metadata' slot was meant to be general, in order to
> >>>> accommodate all needs. If a particular type of data requires a certain
> >>>> structure, then additional formal classes may be necessary.  For
> example,
> >>>> gene expression RNA-seq may want a featureData equivalent annotating
> each
> >>>> transcript, whereas with ChIP-seq data, that sort of structure would
> make
> >>>> less sense, short of some additional assumptions.
> >>>>
> >>>
> >>> I agree completely.  Our task is to think/experiment about how to
> suitably
> >>> specialize these structures for most effective downstream use.  Reuse
> by
> >>> multiple downstream toolchains would be great.
> >>>
> >>>
> >>
> >>>> Michael
> >>>>
> >>>>> sessionInfo()
> >>>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
> >>>>> x86_64-apple-darwin10.2.0
> >>>>>
> >>>>> locale:
> >>>>> [1] C
> >>>>>
> >>>>> attached base packages:
> >>>>> [1] stats     graphics  grDevices datasets  tools     utils
> methods
> >>>>>
> >>>>> [8] base
> >>>>>
> >>>>> other attached packages:
> >>>>>  [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
> >>>>>  [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
> >>>>>  [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
> >>>>> [10] digest_0.4.1
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]
> >wrote:
> >>>>>
> >>>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
> >>>>>>> On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Following a recent thread, I also have found convenient to store
> >>>>>> nextgen
> >>>>>>>> data as RangedData instead of ShortRead objects. They require far
> >>>>>> less
> >>>>>>>> memory and make feasible working with several samples at the same
> >>>>>> time (in
> >>>>>>>> my 8Gb RAM desktop I can load 2 ShortRead objects at the most,
> with
> >>>>>>>> RangedData I haven't struck the upper limit yet).
> >>>>>>>>
> >>>>>>>> I am thinking about taking this idea a step forward:
> RangedDataList
> >>>>>> allows
> >>>>>>>> storing info from several samples (e.g. IP and control) in a
> single
> >>>>>> object.
> >>>>>>>> The only problem is RangedDataList does not store information
> about
> >>>>>> the
> >>>>>>>> samples, e.g. the phenoData we're used to in ExpressionSet
> objects.
> >>>>>> My idea
> >>>>>>>> is to define something like a "SequenceSet" class, which would
> >>>>>> contain a
> >>>>>>>> RangedDataList with the ranges, a phenoData with sample
> information,
> >>>>>> and
> >>>>>>>> possibly also information about the experiment (e.g. with the
> MIAME
> >>>>>> analog
> >>>>>>>> for sequencing, MIASEQE).
> >>>>>>>>
> >>>>>>>> The thing is I don't want to re-invent the wheel. I haven't seen
> >>>>>> that this
> >>>>>>>> is implemented yet, but is someone working on it? Any criticism/
> >>>>>> ideas?
> >>>>>>>>
> >>>>>>>>
> >>>>>>> RangedDataList already supports this. See the 'elementMetadata' and
> >>>>>>> 'metadata' slots in the Sequence class.
> >>>>>>
> >>>>>> Hi David et al.,
> >>>>>>
> >>>>>> I've also found the elementMetadata slot excellent for this purpose.
> >>>>>> The ShortRead data objects retain sequence and quality information,
> >>>>>> this
> >>>>>> information is often not needed after a certain point in the
> analysis.
> >>>>>>
> >>>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which
> has a
> >>>>>> GRanges class that is more fastidious about strand information
> (maybe a
> >>>>>> plus?) and conforms more to an 'I am a rectangular data structure'
> >>>>>> world
> >>>>>> view. Also the GappedAlignments class for efficiently representing
> >>>>>> large
> >>>>>> numbers of reads.
> >>>>>>
> >>>>>> Martin
> >>>>>>
> >>>>>>>
> >>>>>>> Michael
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> David
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> David Rossell, PhD
> >>>>>>>> Manager, Bioinformatics and Biostatistics unit
> >>>>>>>> IRB Barcelona
> >>>>>>>> Tel (+34) 93 402 0217
> >>>>>>>> Fax (+34) 93 402 0257
> >>>>>>>> http://www.irbbarcelona.org/bioinformatics
> >>>>>>>>
> >>>>>>>>        [[alternative HTML version deleted]]
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Bioc-sig-sequencing mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>>>>>>>
> >>>>>>>
> >>>>>>>       [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioc-sig-sequencing mailing list
> >>>>>>> [email protected]
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Martin Morgan
> >>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
> >>>>>> 1100 Fairview Ave. N.
> >>>>>> PO Box 19024 Seattle, WA 98109
> >>>>>>
> >>>>>> Location: Arnold Building M1 B861
> >>>>>> Phone: (206) 667-2793
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioc-sig-sequencing mailing list
> >>>>>> [email protected]
> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Reply via email to