Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Vincent Carey Fri, 02 Apr 2010 08:58:21 -0700

my unfiltered reaction is to keep it in chipseq -- it would be nice for
GenomicRanges to become quite stable and highly generic.  some subclassing
of GRanges will doubtless go on, but when the target use case is ChIP-seq
analysis, the fact that chipseq has some analysis tools should not prevent
it from being the incubator for more general structure designs that do not
address these specific analysis approaches.


if we find that this inhibits reuse we can take some other approach.  with
relatively mature focused resource importation facilities now available
there should be no inhibition.

On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence <[email protected]
> wrote:

> I've recently taken over the maintenance/development of the chipseq package
> and have plans for a lot of refactoring, including some new formal classes
> for ChIP-seq data. I'm wondering though if 'chipseq' is the best place,
> given that it also includes some specific analytical methods. That's not a
> huge deal, but might GenomicRanges be the place for these high-level
> structures?
>
>
> On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey 
> <[email protected]>wrote:
>
>>
>>
>> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence <
>> [email protected]> wrote:
>>
>>>
>>>
>>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <
>>> [email protected]> wrote:
>>>
>>>> To get a bit more concrete regarding these notions, the leeBamViews
>>>> package is in the experimental data archive, a VERY rudimentary 
>>>> illustration
>>>> of a workflow rooted in BAM archive files through region specification and
>>>> read counting.  For the very latest checkin, after running
>>>>
>>>> example(bs1)
>>>>
>>>> we have an ad hoc tabulation of read counts:
>>>>
>>>> bs1> tabulateReads(bs1, "+")
>>>>          intv1  intv2
>>>> start   861250 863000
>>>> end     862750 864000
>>>> isowt.5   3673   2692
>>>> isowt.6   3770   2650
>>>> rlp.5     1532   1045
>>>> rlp.6     1567   1139
>>>> ssr.1     4304   3052
>>>> ssr.2     4627   3381
>>>> xrn.1     2841   1693
>>>> xrn.2     3477   2197
>>>>
>>>> or, by setting as.GRanges, a GRanges-based representation
>>>>
>>>> > tabulateReads(bs1, "+", as.GRanges=TRUE)
>>>> GRanges with 2 ranges and 9 elementMetadata values
>>>>     seqnames           ranges strand |        name   isowt.5   isowt.6
>>>>        <Rle>        <IRanges>  <Rle> | <character> <integer> <integer>
>>>> [1]  Scchr13 [861250, 862750]      + |       intv1      3673      3770
>>>> [2]  Scchr13 [863000, 864000]      + |       intv2      2692      2650
>>>>         rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
>>>>     <integer> <integer> <integer> <integer> <integer> <integer>
>>>> [1]      1532      1567      4304      4627      2841      3477
>>>> [2]      1045      1139      3052      3381      1693      2197
>>>>
>>>> seqlengths
>>>> Scchr13
>>>>      NA
>>>> > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
>>>> > metadata(OO)
>>>> list()
>>>>
>>>> It seems that we would want more structure in a metadata component to
>>>> get closer to the values of ExpressionSet discipline.  We would also want
>>>> some accommodation of this kind of representation in the downstream 
>>>> packages
>>>> like edgeR, DEseq.
>>>>
>>>>
>>> The actual 'metadata' slot was meant to be general, in order to
>>> accommodate all needs. If a particular type of data requires a certain
>>> structure, then additional formal classes may be necessary.  For example,
>>> gene expression RNA-seq may want a featureData equivalent annotating each
>>> transcript, whereas with ChIP-seq data, that sort of structure would make
>>> less sense, short of some additional assumptions.
>>>
>>
>> I agree completely.  Our task is to think/experiment about how to suitably
>> specialize these structures for most effective downstream use.  Reuse by
>> multiple downstream toolchains would be great.
>>
>>
>
>>> Michael
>>>
>>> > sessionInfo()
>>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
>>>> x86_64-apple-darwin10.2.0
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>>>>
>>>> [8] base
>>>>
>>>> other attached packages:
>>>>  [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
>>>>  [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
>>>>  [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
>>>> [10] digest_0.4.1
>>>>
>>>>
>>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]>wrote:
>>>>
>>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
>>>>> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
>>>>> > [email protected]> wrote:
>>>>> >
>>>>> >> Following a recent thread, I also have found convenient to store
>>>>> nextgen
>>>>> >> data as RangedData instead of ShortRead objects. They require far
>>>>> less
>>>>> >> memory and make feasible working with several samples at the same
>>>>> time (in
>>>>> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
>>>>> >> RangedData I haven't struck the upper limit yet).
>>>>> >>
>>>>> >> I am thinking about taking this idea a step forward: RangedDataList
>>>>> allows
>>>>> >> storing info from several samples (e.g. IP and control) in a single
>>>>> object.
>>>>> >> The only problem is RangedDataList does not store information about
>>>>> the
>>>>> >> samples, e.g. the phenoData we're used to in ExpressionSet objects.
>>>>> My idea
>>>>> >> is to define something like a "SequenceSet" class, which would
>>>>> contain a
>>>>> >> RangedDataList with the ranges, a phenoData with sample information,
>>>>> and
>>>>> >> possibly also information about the experiment (e.g. with the MIAME
>>>>> analog
>>>>> >> for sequencing, MIASEQE).
>>>>> >>
>>>>> >> The thing is I don't want to re-invent the wheel. I haven't seen
>>>>> that this
>>>>> >> is implemented yet, but is someone working on it? Any criticism/
>>>>> ideas?
>>>>> >>
>>>>> >>
>>>>> > RangedDataList already supports this. See the 'elementMetadata' and
>>>>> > 'metadata' slots in the Sequence class.
>>>>>
>>>>> Hi David et al.,
>>>>>
>>>>> I've also found the elementMetadata slot excellent for this purpose.
>>>>> The ShortRead data objects retain sequence and quality information,
>>>>> this
>>>>> information is often not needed after a certain point in the analysis.
>>>>>
>>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
>>>>> GRanges class that is more fastidious about strand information (maybe a
>>>>> plus?) and conforms more to an 'I am a rectangular data structure'
>>>>> world
>>>>> view. Also the GappedAlignments class for efficiently representing
>>>>> large
>>>>> numbers of reads.
>>>>>
>>>>> Martin
>>>>>
>>>>> >
>>>>> > Michael
>>>>> >
>>>>> >
>>>>> >
>>>>> >> Best,
>>>>> >>
>>>>> >> David
>>>>> >>
>>>>> >> --
>>>>> >> David Rossell, PhD
>>>>> >> Manager, Bioinformatics and Biostatistics unit
>>>>> >> IRB Barcelona
>>>>> >> Tel (+34) 93 402 0217
>>>>> >> Fax (+34) 93 402 0257
>>>>> >> http://www.irbbarcelona.org/bioinformatics
>>>>> >>
>>>>> >>        [[alternative HTML version deleted]]
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> Bioc-sig-sequencing mailing list
>>>>> >> [email protected]
>>>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>> >>
>>>>> >
>>>>> >       [[alternative HTML version deleted]]
>>>>> >
>>>>> > _______________________________________________
>>>>> > Bioc-sig-sequencing mailing list
>>>>> > [email protected]
>>>>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>>
>>>>> --
>>>>> Martin Morgan
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> [email protected]
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>
>>>>
>>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Reply via email to