Hi Michael, Thanks for your suggestions so far ... being able to slice a Range object w/ another Range object somehow fell off of my radar.
-steve On Thu, Sep 10, 2009 at 6:25 PM, Michael Lawrence<[email protected]> wrote: > > > On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou > <[email protected]> wrote: >> >> Hi all, >> >> I'm finding it hard to shed some of my 10 thumbs when using the >> IRanges/Rle classes, so can someone suggest the "correct" way to do what my >> code does below? >> >> My goal is to use GenomeGraphs to plot a picture of my gene model, along >> with the reads from 1 or more samples above it. I've actually got that >> working, but I feel there's a more idiomatic way to do this, so if you have >> any comments, I'd be interested in hearing them. >> >> I'll paste the function in its entirety at the end of the email. It works >> as advertised, but perhaps not done in the most efficient way. For brevity's >> sake, however, I've distilled the specific code for your feedback. >> >> Anyway, here's the gist: >> >> 1. sample.reads : An IRanges object of the reads aligned to a particular >> chromosome from a particular rna-seq sample: >> >> R> > head(sample.reads) >> IRanges instance: >> start end width >> [1] 11869 11900 32 >> [2] 11882 11913 32 >> [3] 12245 12276 32 >> [4] 12554 12585 32 >> [5] 12555 12586 32 >> [6] 12557 12588 32 >> >> 2. bounds: An IRange object that indicates the start and end position of >> my gene -- I want all the reads that land here: >> >> R> bounds >> IRanges instance: >> start end width >> [1] 12040238 12073571 33334 >> >> Goal: >> I wan to build two vectors: >> (i) A vector of positions on my chromosome (having length == to the length >> of my gene) >> (ii) A vector of counts over those positions >> >> Positions with 0 counts are removed from (i) and (ii) >> >> ==== Here's my code ==== >> >> # Get the index of the reads from my sample.reads >> # object that lie within my gene bounds >> which.reads <- subjectHits(overlap(sample.reads, bounds)) >> >> # Slice out these reads >> gene.reads <- reads[which.reads] >> > > The above two steps could be simply: >> gene.reads <- sample.reads[bounds] > >> >> # Build my coverage Rle vector >> cov <- coverage(gene.reads) >> >> # Only keep the part of the Rle that starts @ the start >> # of my gene to the end (I feel like doing this is weird (?)) >> cov <- cov[start(bounds):length(cov)] >> >> # Create the non-Rle coverage vector that I will use down stream >> # to make a GenomeGraph::BaseTrack object >> counts <- numeric(width(bounds)) >> >> # Populate this vector with the correct coverage counts >> counts[1:length(cov)] <- as.vector(cov) >> > > As you say, this looks like it could be done easier using arguments to > coverage(). Maybe someone else could explain this? > >> # Build the base-position vector and filter out positions >> # with 0 counts from both vectors >> keep <- counts != 0 >> >> # This is (i) from above >> bases <- (start(bounds):end(bounds))[keep] >> > > A bit shorter: > bases <- as.integer(bounds)[keep] > >> >> # This is (ii) from above >> counts <- counts[keep] >> >> ===================== >> >> I feel like the coverage interface, using the start/end args would help >> here, no? Wouldn't this have worked to make things a bit easier? >> >> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds)) >> >> How would you do that by using the width/shift args? Or wouldn't you? >> Maybe I'm not clear on what the original start/end args in ``coverage`` are >> meant to do? >> >> Anyway, I was just curious if there was a better way to do what I need to >> do. >> >> Thanks, >> >> -steve >> >> >> ===== Here's the function in all of its glory, use it if you like ===== >> >> Here's a sample call: >> >> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2), >> gene.id='ENSG00000116688', title="MFN2") >> >> Here's the function: >> >> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL, >> gene.id=NULL, >> biomart=NULL, cols=NULL, title=NULL) { >> # Plots the read counts over a particular gene model. >> # >> # You can pass your own gene.model as an IRange object, or use biomaRt to >> # retrieve the model using the Ensembl Gene ID gene.id >> # >> # Parameters >> # ---------- >> # sample.reads : named list[IRanges] representing reads from separate >> samples >> # over a given chromosome (the one your gene is on!) >> # gene.model : IRanges object of start/end positions of the gene >> # mart : biomaRt, if you want to query ensembl for the gene model >> # cols : the colors you want to use for the reads in the >> respective >> # samples >> if (is.null(gene.model) && is.null(gene.id)) { >> stop("Need to get the gene.model from somewhere.") >> } >> >> # Figure out what type of gene.model we're using and get appropriate >> bounds >> if (!is.null(gene.model)) { >> gm <- makeGeneModel(start=start(gene.model), end=end(gene.model)) >> bounds <- range(gene.model) >> } else { >> if (is.null(biomart)) { >> biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl') >> } >> gm <- makeGene(id=gene.id, type='ensembl_gene_id', biomart=biomart) >> if (is.null(title)) { >> title <- gene.id >> } >> anno <- g...@ens >> bounds <- range(IRanges(start=anno$exon_chrom_start, >> end=anno$exon_chrom_end)) >> } >> >> sample.names <- names(sample.reads) >> >> if (is.null(cols)) { >> cols <- rainbow(length(sample.reads)) >> names(cols) <- sample.names >> } >> >> if (is.null(title)) { >> title <- 'Gene Reads' >> } >> >> stopifnot(length(cols) == length(sample.reads)) >> stopifnot(names(cols) == sample.names) >> >> # Calculate the coverage vectors over the gene model for each sample >> sample.coverage <- lapply(sample.names, function(name) { >> reads <- sample.reads[[name]] >> which.reads <- subjectHits(overlap(reads, bounds)) >> gene.reads <- reads[which.reads] >> cov <- coverage(gene.reads) >> cov <- cov[start(bounds):length(cov)] >> >> counts <- numeric(width(bounds)) >> counts[1:length(cov)] <- as.vector(cov) >> keep <- counts != 0 >> bases <- (start(bounds):end(bounds))[keep] >> counts <- counts[keep] >> list(pos=bases, coverage=counts) >> }) >> names(sample.coverage) <- sample.names >> >> # Make y-axis the same for each sample >> max.counts <- max(sapply(sample.coverage, function(cov) >> max(cov$coverage))) >> >> tracks <- lapply(sample.names, function(name) { >> dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0, max.counts)) >> makeBaseTrack(base=sample.coverage[[name]]$pos, >> value=sample.coverage[[name]]$coverage, >> dp=dp) >> }) >> names(tracks) <- names(sample.reads) >> >> title <- makeTitle(text=title) >> plot.me <- c(title, tracks, list(gm)) >> >> gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds)) >> } >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
