Thank you both - that helps. I think for my current project I'll stick with RangedData (everything is coded already) but this'll help me decide which to use in future.

Janet



On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote:

Ivan does a pretty good job. But just to summarize and fill in the gaps:

GenomicRanges (an abstract class that is made concrete by GRanges) is essentially a table of ranges, chromosomes (actually generalized to sequence names), and strand. It also has a formally associated "SeqInfo" object which stores the sequence lengths. Then there are metadata columns added by the user, but these are sort of "second class" compared to the RangedData design. The ranges are primary. A GRanges can be placed into a GRangesList, which is the data structure of choice for holding compound ranges, like gene structures and read mappings.

RangedData acts more like a data frame with a formal notion of ranges divided into spaces (chromosomes). The API is very much more like a data frame compared to GenomicRanges, where user columns are at the same "level" as the ranges. It can informally hold the same information as GenomicRanges, like strand and a SeqInfo in its metadata. In terms of implementation, RangedData consists of two parallel lists, a RangesList for the ranges and a SplitDataFrameList for the rest of the columns. This means that the data must, as Ivan mentioned, be sorted by chromosome. But there are advantages over GenomicRanges when a RangesList cannot be flattened to a Ranges. These include the ability to store an RleViewsList (preserving the coverage information) or a RangesList of IntervalTree objects (allowing fast interval queries) as the ranges.

Choosing one or the other depends on the use case. For RNA-seq, where one has complex read mappings to complex gene structures, GRanges(Lists) are the best in my opinion. But then for ChIP-seq peaks, where the strand does not matter and the ranges simple, one might prefer the data frame features of RangedData and its ability to keep the coverage around.

Michael

On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti <[email protected]> wrote:
Hello Janet,

It is a rare pleasure to have the opportunity to enlighten somebody
from the Fred Hutchinson Cancer Research Center about R functionality.

The bottom line is this: GenomicRanges is much more biology-awared
than the generic RangedData class.

GenomicRanges natively stores a strand value per feature. RangedData
does not, unless you create it. GenomicRanges' strand values are very
intuitive: +, -, and *.

GenomicRanges "rows" can be ordered by any "column" even if it ends up
dis-ordering the chromosomes. RangedData can only order features
within each space.

GenomicRanges can store the complete list of chromosomes and their
corresponding sizes four your particular organism. You can create a
GenomicRanges instance out of a RangedData without providing
explicitly the list of chromosomes and their sizes. Just do

library(GenomicRanges)
my_gr <- as(my_rd,"GRanges")

The list of chromosomes is gathered on the fly from the features. The
list chromosome lengths still has to be assigned manually, which is
fine.

Nowadays you can rtracklayer::import() BED directly as GenomicRanges.

Importing large BED into either GenomicRanges or RangedData is, in my
experience, equally slow. There is no difference there.

Why not forgetting RangedData then? The advantage over GenomicRanges
is, also in my experience, that it accepts features mapped beyond the
limits of chromosomes. The most unforgiving example is mitochondrial
DNA. Because it is circular, it naturally gets sequencing reads with
"starts" that are numerically larger than it "ends".

In high throughput sequencing I still use RangedData when
1) I do not care about relatively few misbehaving reads
2) I need my script to run without errors from GenomicRanges sanity check.

For everyday high throughput sequencing I use GenomicRanges keeping
the chromosome lengths unassigned. It could be called a hybrid.

I hope this helps.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878



On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <[email protected]> wrote:
> Hi,
>
> I've been on a long long vacation, so I'm a bit more out of the loop than I
> usually am.
>
> I've been using RangedData a lot in my code until now to represent sets of > genomic regions spread over multiple chromosomes, and I've just realized
> that GenomicRanges has a lot of the same characteristics.
>
> I wanted to ask you all
> - whether RangedData and GenomicRanges are pretty much equivalent, or if
> there are functions that exist for one but not the other?
> - whether I can use pretty much the same code and functions if I switch
> everything over to use GenomicRanges?
> - are there subtle differences I should be careful of if I make the switch?
>
> thanks very much,
>
> Janet Young
>
>
> -------------------------------------------------------------------
>
> Dr. Janet Young (Trask lab)
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung  ...at...  fhcrc.org
>
> http://www.fhcrc.org/labs/trask/
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to