Re: [Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Janet Young Mon, 01 Nov 2010 10:40:52 -0700

Thank you both - that helps. I think for my current project I'llstick with RangedData (everything is coded already) but this'll helpme decide which to use in future.


Janet




On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote:

Ivan does a pretty good job. But just to summarize and fill in thegaps:
GenomicRanges (an abstract class that is made concrete by GRanges)is essentially a table of ranges, chromosomes (actually generalizedto sequence names), and strand. It also has a formally associated"SeqInfo" object which stores the sequence lengths. Then there aremetadata columns added by the user, but these are sort of "secondclass" compared to the RangedData design. The ranges are primary. AGRanges can be placed into a GRangesList, which is the datastructure of choice for holding compound ranges, like genestructures and read mappings.
RangedData acts more like a data frame with a formal notion ofranges divided into spaces (chromosomes). The API is very much morelike a data frame compared to GenomicRanges, where user columns areat the same "level" as the ranges. It can informally hold the sameinformation as GenomicRanges, like strand and a SeqInfo in itsmetadata. In terms of implementation, RangedData consists of twoparallel lists, a RangesList for the ranges and a SplitDataFrameListfor the rest of the columns. This means that the data must, as Ivanmentioned, be sorted by chromosome. But there are advantages overGenomicRanges when a RangesList cannot be flattened to a Ranges.These include the ability to store an RleViewsList (preserving thecoverage information) or a RangesList of IntervalTree objects(allowing fast interval queries) as the ranges.
Choosing one or the other depends on the use case. For RNA-seq,where one has complex read mappings to complex gene structures,GRanges(Lists) are the best in my opinion. But then for ChIP-seqpeaks, where the strand does not matter and the ranges simple, onemight prefer the data frame features of RangedData and its abilityto keep the coverage around.
Michael
On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti<[email protected]> wrote:
Hello Janet,

It is a rare pleasure to have the opportunity to enlighten somebody
from the Fred Hutchinson Cancer Research Center about R functionality.

The bottom line is this: GenomicRanges is much more biology-awared
than the generic RangedData class.

GenomicRanges natively stores a strand value per feature. RangedData
does not, unless you create it. GenomicRanges' strand values are very
intuitive: +, -, and *.

GenomicRanges "rows" can be ordered by any "column" even if it ends up
dis-ordering the chromosomes. RangedData can only order features
within each space.

GenomicRanges can store the complete list of chromosomes and their
corresponding sizes four your particular organism. You can create a
GenomicRanges instance out of a RangedData without providing
explicitly the list of chromosomes and their sizes. Just do

library(GenomicRanges)
my_gr <- as(my_rd,"GRanges")

The list of chromosomes is gathered on the fly from the features. The
list chromosome lengths still has to be assigned manually, which is
fine.

Nowadays you can rtracklayer::import() BED directly as GenomicRanges.

Importing large BED into either GenomicRanges or RangedData is, in my
experience, equally slow. There is no difference there.

Why not forgetting RangedData then? The advantage over GenomicRanges
is, also in my experience, that it accepts features mapped beyond the
limits of chromosomes. The most unforgiving example is mitochondrial
DNA. Because it is circular, it naturally gets sequencing reads with
"starts" that are numerically larger than it "ends".

In high throughput sequencing I still use RangedData when
1) I do not care about relatively few misbehaving reads
2) I need my script to run without errors from GenomicRanges sanitycheck.
For everyday high throughput sequencing I use GenomicRanges keeping
the chromosome lengths unassigned. It could be called a hybrid.

I hope this helps.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878
On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <[email protected]>wrote:
> Hi,
>
> I've been on a long long vacation, so I'm a bit more out of theloop than I
> usually am.
>
> I've been using RangedData a lot in my code until now to representsets of> genomic regions spread over multiple chromosomes, and I've justrealized
> that GenomicRanges has a lot of the same characteristics.
>
> I wanted to ask you all
> - whether RangedData and GenomicRanges are pretty much equivalent,or if
> there are functions that exist for one but not the other?
> - whether I can use pretty much the same code and functions if Iswitch
> everything over to use GenomicRanges?
> - are there subtle differences I should be careful of if I makethe switch?
>
> thanks very much,
>
> Janet Young
>
>
> -------------------------------------------------------------------
>
> Dr. Janet Young (Trask lab)
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung  ...at...  fhcrc.org
>
> http://www.fhcrc.org/labs/trask/
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Reply via email to