On Mon, Nov 1, 2010 at 12:20 PM, Martin Morgan <[email protected]> wrote:
> On 11/01/2010 10:40 AM, Janet Young wrote: > > Thank you both - that helps. I think for my current project I'll stick > > with RangedData (everything is coded already) but this'll help me decide > > which to use in future. > > Hi Janet et al., > > Two one-cent pieces on this. > > Ivan mentions > > > limits of chromosomes. The most unforgiving example is mitochondrial > > DNA. Because it is circular, it naturally gets sequencing reads with > > "starts" that are numerically larger than it "ends". > > GRanges is becoming circularity aware, and there are hints of that in > ?GRanges and elsewhere, but this will only be fully developed in the > next release. > > Michael says > > >> Then there are > >> metadata columns added by the user, but these are sort of "second > >> class" compared to the RangedData design. The ranges are primary. > > The 'second class' is confusing to me a bit. I view a GRanges instance > gr as consisting of two parts, the ranges(gr) and the 'user data' > values(gr). I view this as a separation of information, rather than > subordinating one source of data to another. > > Many of the methods on GRanges are Ranges methods, like findOverlaps, etc, so one could think of GRanges as more of a Ranges object, with a formal notion of chromosome and strand. That is, one can call start(gr), as an alternative to start(ranges(gr)). However, to treat the GRanges as a dataset, you need to call values() first, i.e., values(gr)$foo instead of gr$foo. Thus, in terms of API, GRanges is very much ranges first, metadata columns second. With RangedData, you can call start(rd), or rd$foo, or subset(rd, foo), or colnames(rd) etc. It acts much more like a data frame. Martin > > > > > Janet > > > > > > > > On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote: > > > >> Ivan does a pretty good job. But just to summarize and fill in the gaps: > >> > >> GenomicRanges (an abstract class that is made concrete by GRanges) is > >> essentially a table of ranges, chromosomes (actually generalized to > >> sequence names), and strand. It also has a formally associated > >> "SeqInfo" object which stores the sequence lengths. Then there are > >> metadata columns added by the user, but these are sort of "second > >> class" compared to the RangedData design. The ranges are primary. A > >> GRanges can be placed into a GRangesList, which is the data structure > >> of choice for holding compound ranges, like gene structures and read > >> mappings. > >> > >> RangedData acts more like a data frame with a formal notion of ranges > >> divided into spaces (chromosomes). The API is very much more like a > >> data frame compared to GenomicRanges, where user columns are at the > >> same "level" as the ranges. It can informally hold the same > >> information as GenomicRanges, like strand and a SeqInfo in its > >> metadata. In terms of implementation, RangedData consists of two > >> parallel lists, a RangesList for the ranges and a SplitDataFrameList > >> for the rest of the columns. This means that the data must, as Ivan > >> mentioned, be sorted by chromosome. But there are advantages over > >> GenomicRanges when a RangesList cannot be flattened to a Ranges. These > >> include the ability to store an RleViewsList (preserving the coverage > >> information) or a RangesList of IntervalTree objects (allowing fast > >> interval queries) as the ranges. > >> > >> Choosing one or the other depends on the use case. For RNA-seq, where > >> one has complex read mappings to complex gene structures, > >> GRanges(Lists) are the best in my opinion. But then for ChIP-seq > >> peaks, where the strand does not matter and the ranges simple, one > >> might prefer the data frame features of RangedData and its ability to > >> keep the coverage around. > >> > >> Michael > >> > >> On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti <[email protected]> > >> wrote: > >> Hello Janet, > >> > >> It is a rare pleasure to have the opportunity to enlighten somebody > >> from the Fred Hutchinson Cancer Research Center about R functionality. > >> > >> The bottom line is this: GenomicRanges is much more biology-awared > >> than the generic RangedData class. > >> > >> GenomicRanges natively stores a strand value per feature. RangedData > >> does not, unless you create it. GenomicRanges' strand values are very > >> intuitive: +, -, and *. > >> > >> GenomicRanges "rows" can be ordered by any "column" even if it ends up > >> dis-ordering the chromosomes. RangedData can only order features > >> within each space. > >> > >> GenomicRanges can store the complete list of chromosomes and their > >> corresponding sizes four your particular organism. You can create a > >> GenomicRanges instance out of a RangedData without providing > >> explicitly the list of chromosomes and their sizes. Just do > >> > >> library(GenomicRanges) > >> my_gr <- as(my_rd,"GRanges") > >> > >> The list of chromosomes is gathered on the fly from the features. The > >> list chromosome lengths still has to be assigned manually, which is > >> fine. > >> > >> Nowadays you can rtracklayer::import() BED directly as GenomicRanges. > >> > >> Importing large BED into either GenomicRanges or RangedData is, in my > >> experience, equally slow. There is no difference there. > >> > >> Why not forgetting RangedData then? The advantage over GenomicRanges > >> is, also in my experience, that it accepts features mapped beyond the > >> limits of chromosomes. The most unforgiving example is mitochondrial > >> DNA. Because it is circular, it naturally gets sequencing reads with > >> "starts" that are numerically larger than it "ends". > >> > >> In high throughput sequencing I still use RangedData when > >> 1) I do not care about relatively few misbehaving reads > >> 2) I need my script to run without errors from GenomicRanges sanity > >> check. > >> > >> For everyday high throughput sequencing I use GenomicRanges keeping > >> the chromosome lengths unassigned. It could be called a hybrid. > >> > >> I hope this helps. > >> > >> Ivan > >> > >> Ivan Gregoretti, PhD > >> National Institute of Diabetes and Digestive and Kidney Diseases > >> National Institutes of Health > >> 5 Memorial Dr, Building 5, Room 205. > >> Bethesda, MD 20892. USA. > >> Phone: 1-301-496-1016 and 1-301-496-1592 > >> Fax: 1-301-496-9878 > >> > >> > >> > >> On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <[email protected]> wrote: > >> > Hi, > >> > > >> > I've been on a long long vacation, so I'm a bit more out of the loop > >> than I > >> > usually am. > >> > > >> > I've been using RangedData a lot in my code until now to represent > >> sets of > >> > genomic regions spread over multiple chromosomes, and I've just > >> realized > >> > that GenomicRanges has a lot of the same characteristics. > >> > > >> > I wanted to ask you all > >> > - whether RangedData and GenomicRanges are pretty much equivalent, > >> or if > >> > there are functions that exist for one but not the other? > >> > - whether I can use pretty much the same code and functions if I > switch > >> > everything over to use GenomicRanges? > >> > - are there subtle differences I should be careful of if I make the > >> switch? > >> > > >> > thanks very much, > >> > > >> > Janet Young > >> > > >> > > >> > ------------------------------------------------------------------- > >> > > >> > Dr. Janet Young (Trask lab) > >> > > >> > Fred Hutchinson Cancer Research Center > >> > 1100 Fairview Avenue N., C3-168, > >> > P.O. Box 19024, Seattle, WA 98109-1024, USA. > >> > > >> > tel: (206) 667 1471 fax: (206) 667 6524 > >> > email: jayoung ...at... fhcrc.org > >> > > >> > http://www.fhcrc.org/labs/trask/ > >> > > >> > _______________________________________________ > >> > Bioc-sig-sequencing mailing list > >> > [email protected] > >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >> > > >> > >> _______________________________________________ > >> Bioc-sig-sequencing mailing list > >> [email protected] > >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >> > > > > _______________________________________________ > > Bioc-sig-sequencing mailing list > > [email protected] > > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
