Re: [Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Michael Lawrence Mon, 01 Nov 2010 13:46:01 -0700

On Mon, Nov 1, 2010 at 12:20 PM, Martin Morgan <[email protected]> wrote:


> On 11/01/2010 10:40 AM, Janet Young wrote:
> > Thank you both - that helps.  I think for my current project I'll stick
> > with RangedData (everything is coded already) but this'll help me decide
> > which to use in future.
>
> Hi Janet et al.,
>
> Two one-cent pieces on this.
>
> Ivan mentions
>
> > limits of chromosomes. The most unforgiving example is mitochondrial
> > DNA. Because it is circular, it naturally gets sequencing reads with
> > "starts" that are numerically larger than it "ends".
>
> GRanges is becoming circularity aware, and there are hints of that in
> ?GRanges and elsewhere, but this will only be fully developed in the
> next release.
>
> Michael says
>
> >> Then there are
> >> metadata columns added by the user, but these are sort of "second
> >> class" compared to the RangedData design. The ranges are primary.
>
> The 'second class' is confusing to me a bit. I view a GRanges instance
> gr as consisting of two parts, the ranges(gr) and the 'user data'
> values(gr). I view this as a separation of information, rather than
> subordinating one source of data to another.
>
>

Many of the methods on GRanges are Ranges methods, like findOverlaps, etc,
so one could think of GRanges as more of a Ranges object, with a formal
notion of chromosome and strand. That is, one can call start(gr), as an
alternative to start(ranges(gr)). However, to treat the GRanges as a
dataset, you need to call values() first, i.e., values(gr)$foo instead of
gr$foo. Thus, in terms of API, GRanges is very much ranges first, metadata
columns second.

With RangedData, you can call start(rd), or rd$foo, or subset(rd, foo), or
colnames(rd) etc. It acts much more like a data frame.


Martin
>
> >
> > Janet
> >
> >
> >
> > On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote:
> >
> >> Ivan does a pretty good job. But just to summarize and fill in the gaps:
> >>
> >> GenomicRanges (an abstract class that is made concrete by GRanges) is
> >> essentially a table of ranges, chromosomes (actually generalized to
> >> sequence names), and strand. It also has a formally associated
> >> "SeqInfo" object which stores the sequence lengths. Then there are
> >> metadata columns added by the user, but these are sort of "second
> >> class" compared to the RangedData design. The ranges are primary. A
> >> GRanges can be placed into a GRangesList, which is the data structure
> >> of choice for holding compound ranges, like gene structures and read
> >> mappings.
> >>
> >> RangedData acts more like a data frame with a formal notion of ranges
> >> divided into spaces (chromosomes). The API is very much more like a
> >> data frame compared to GenomicRanges, where user columns are at the
> >> same "level" as the ranges. It can informally hold the same
> >> information as GenomicRanges, like strand and a SeqInfo in its
> >> metadata. In terms of implementation, RangedData consists of two
> >> parallel lists, a RangesList for the ranges and a SplitDataFrameList
> >> for the rest of the columns. This means that the data must, as Ivan
> >> mentioned, be sorted by chromosome. But there are advantages over
> >> GenomicRanges when a RangesList cannot be flattened to a Ranges. These
> >> include the ability to store an RleViewsList (preserving the coverage
> >> information) or a RangesList of IntervalTree objects (allowing fast
> >> interval queries) as the ranges.
> >>
> >> Choosing one or the other depends on the use case. For RNA-seq, where
> >> one has complex read mappings to complex gene structures,
> >> GRanges(Lists) are the best in my opinion. But then for ChIP-seq
> >> peaks, where the strand does not matter and the ranges simple, one
> >> might prefer the data frame features of RangedData and its ability to
> >> keep the coverage around.
> >>
> >> Michael
> >>
> >> On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti <[email protected]>
> >> wrote:
> >> Hello Janet,
> >>
> >> It is a rare pleasure to have the opportunity to enlighten somebody
> >> from the Fred Hutchinson Cancer Research Center about R functionality.
> >>
> >> The bottom line is this: GenomicRanges is much more biology-awared
> >> than the generic RangedData class.
> >>
> >> GenomicRanges natively stores a strand value per feature. RangedData
> >> does not, unless you create it. GenomicRanges' strand values are very
> >> intuitive: +, -, and *.
> >>
> >> GenomicRanges "rows" can be ordered by any "column" even if it ends up
> >> dis-ordering the chromosomes. RangedData can only order features
> >> within each space.
> >>
> >> GenomicRanges can store the complete list of chromosomes and their
> >> corresponding sizes four your particular organism. You can create a
> >> GenomicRanges instance out of a RangedData without providing
> >> explicitly the list of chromosomes and their sizes. Just do
> >>
> >> library(GenomicRanges)
> >> my_gr <- as(my_rd,"GRanges")
> >>
> >> The list of chromosomes is gathered on the fly from the features. The
> >> list chromosome lengths still has to be assigned manually, which is
> >> fine.
> >>
> >> Nowadays you can rtracklayer::import() BED directly as GenomicRanges.
> >>
> >> Importing large BED into either GenomicRanges or RangedData is, in my
> >> experience, equally slow. There is no difference there.
> >>
> >> Why not forgetting RangedData then? The advantage over GenomicRanges
> >> is, also in my experience, that it accepts features mapped beyond the
> >> limits of chromosomes. The most unforgiving example is mitochondrial
> >> DNA. Because it is circular, it naturally gets sequencing reads with
> >> "starts" that are numerically larger than it "ends".
> >>
> >> In high throughput sequencing I still use RangedData when
> >> 1) I do not care about relatively few misbehaving reads
> >> 2) I need my script to run without errors from GenomicRanges sanity
> >> check.
> >>
> >> For everyday high throughput sequencing I use GenomicRanges keeping
> >> the chromosome lengths unassigned. It could be called a hybrid.
> >>
> >> I hope this helps.
> >>
> >> Ivan
> >>
> >> Ivan Gregoretti, PhD
> >> National Institute of Diabetes and Digestive and Kidney Diseases
> >> National Institutes of Health
> >> 5 Memorial Dr, Building 5, Room 205.
> >> Bethesda, MD 20892. USA.
> >> Phone: 1-301-496-1016 and 1-301-496-1592
> >> Fax: 1-301-496-9878
> >>
> >>
> >>
> >> On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > I've been on a long long vacation, so I'm a bit more out of the loop
> >> than I
> >> > usually am.
> >> >
> >> > I've been using RangedData a lot in my code until now to represent
> >> sets of
> >> > genomic regions spread over multiple chromosomes, and I've just
> >> realized
> >> > that GenomicRanges has a lot of the same characteristics.
> >> >
> >> > I wanted to ask you all
> >> > - whether RangedData and GenomicRanges are pretty much equivalent,
> >> or if
> >> > there are functions that exist for one but not the other?
> >> > - whether I can use pretty much the same code and functions if I
> switch
> >> > everything over to use GenomicRanges?
> >> > - are there subtle differences I should be careful of if I make the
> >> switch?
> >> >
> >> > thanks very much,
> >> >
> >> > Janet Young
> >> >
> >> >
> >> > -------------------------------------------------------------------
> >> >
> >> > Dr. Janet Young (Trask lab)
> >> >
> >> > Fred Hutchinson Cancer Research Center
> >> > 1100 Fairview Avenue N., C3-168,
> >> > P.O. Box 19024, Seattle, WA 98109-1024, USA.
> >> >
> >> > tel: (206) 667 1471 fax: (206) 667 6524
> >> > email: jayoung  ...at...  fhcrc.org
> >> >
> >> > http://www.fhcrc.org/labs/trask/
> >> >
> >> > _______________________________________________
> >> > Bioc-sig-sequencing mailing list
> >> > [email protected]
> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Reply via email to