Re: [Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Michael Lawrence Fri, 29 Oct 2010 05:23:03 -0700

Ivan does a pretty good job. But just to summarize and fill in the gaps:

GenomicRanges (an abstract class that is made concrete by GRanges) is
essentially a table of ranges, chromosomes (actually generalized to sequence
names), and strand. It also has a formally associated "SeqInfo" object which
stores the sequence lengths. Then there are metadata columns added by the
user, but these are sort of "second class" compared to the RangedData
design. The ranges are primary. A GRanges can be placed into a GRangesList,
which is the data structure of choice for holding compound ranges, like gene
structures and read mappings.


RangedData acts more like a data frame with a formal notion of ranges
divided into spaces (chromosomes). The API is very much more like a data
frame compared to GenomicRanges, where user columns are at the same "level"
as the ranges. It can informally hold the same information as GenomicRanges,
like strand and a SeqInfo in its metadata. In terms of implementation,
RangedData consists of two parallel lists, a RangesList for the ranges and a
SplitDataFrameList for the rest of the columns. This means that the data
must, as Ivan mentioned, be sorted by chromosome. But there are advantages
over GenomicRanges when a RangesList cannot be flattened to a Ranges. These
include the ability to store an RleViewsList (preserving the coverage
information) or a RangesList of IntervalTree objects (allowing fast interval
queries) as the ranges.

Choosing one or the other depends on the use case. For RNA-seq, where one
has complex read mappings to complex gene structures, GRanges(Lists) are the
best in my opinion. But then for ChIP-seq peaks, where the strand does not
matter and the ranges simple, one might prefer the data frame features of
RangedData and its ability to keep the coverage around.

Michael

On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti <[email protected]> wrote:

> Hello Janet,
>
> It is a rare pleasure to have the opportunity to enlighten somebody
> from the Fred Hutchinson Cancer Research Center about R functionality.
>
> The bottom line is this: GenomicRanges is much more biology-awared
> than the generic RangedData class.
>
> GenomicRanges natively stores a strand value per feature. RangedData
> does not, unless you create it. GenomicRanges' strand values are very
> intuitive: +, -, and *.
>
> GenomicRanges "rows" can be ordered by any "column" even if it ends up
> dis-ordering the chromosomes. RangedData can only order features
> within each space.
>
> GenomicRanges can store the complete list of chromosomes and their
> corresponding sizes four your particular organism. You can create a
> GenomicRanges instance out of a RangedData without providing
> explicitly the list of chromosomes and their sizes. Just do
>
> library(GenomicRanges)
> my_gr <- as(my_rd,"GRanges")
>
> The list of chromosomes is gathered on the fly from the features. The
> list chromosome lengths still has to be assigned manually, which is
> fine.
>
> Nowadays you can rtracklayer::import() BED directly as GenomicRanges.
>
> Importing large BED into either GenomicRanges or RangedData is, in my
> experience, equally slow. There is no difference there.
>
> Why not forgetting RangedData then? The advantage over GenomicRanges
> is, also in my experience, that it accepts features mapped beyond the
> limits of chromosomes. The most unforgiving example is mitochondrial
> DNA. Because it is circular, it naturally gets sequencing reads with
> "starts" that are numerically larger than it "ends".
>
> In high throughput sequencing I still use RangedData when
> 1) I do not care about relatively few misbehaving reads
> 2) I need my script to run without errors from GenomicRanges sanity check.
>
> For everyday high throughput sequencing I use GenomicRanges keeping
> the chromosome lengths unassigned. It could be called a hybrid.
>
> I hope this helps.
>
> Ivan
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <[email protected]> wrote:
> > Hi,
> >
> > I've been on a long long vacation, so I'm a bit more out of the loop than
> I
> > usually am.
> >
> > I've been using RangedData a lot in my code until now to represent sets
> of
> > genomic regions spread over multiple chromosomes, and I've just realized
> > that GenomicRanges has a lot of the same characteristics.
> >
> > I wanted to ask you all
> > - whether RangedData and GenomicRanges are pretty much equivalent, or if
> > there are functions that exist for one but not the other?
> > - whether I can use pretty much the same code and functions if I switch
> > everything over to use GenomicRanges?
> > - are there subtle differences I should be careful of if I make the
> switch?
> >
> > thanks very much,
> >
> > Janet Young
> >
> >
> > -------------------------------------------------------------------
> >
> > Dr. Janet Young (Trask lab)
> >
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Avenue N., C3-168,
> > P.O. Box 19024, Seattle, WA 98109-1024, USA.
> >
> > tel: (206) 667 1471 fax: (206) 667 6524
> > email: jayoung  ...at...  fhcrc.org
> >
> > http://www.fhcrc.org/labs/trask/
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Reply via email to