Re: [Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Michael Lawrence Thu, 10 Sep 2009 15:26:21 -0700

On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou <
[email protected]> wrote:


> Hi all,
>
> I'm finding it hard to shed some of my 10 thumbs when using the IRanges/Rle
> classes, so can someone suggest the "correct" way to do what my code does
> below?
>
> My goal is to use GenomeGraphs to plot a picture of my gene model, along
> with the reads from 1 or more samples above it. I've actually got that
> working, but I feel there's a more idiomatic way to do this, so if you have
> any comments, I'd be interested in hearing them.
>
> I'll paste the function in its entirety at the end of the email. It works
> as advertised, but perhaps not done in the most efficient way. For brevity's
> sake, however, I've distilled the specific code for your feedback.
>
> Anyway, here's the gist:
>
> 1. sample.reads : An IRanges object of the reads aligned to a particular
> chromosome from a particular rna-seq sample:
>
> R> > head(sample.reads)
> IRanges instance:
>    start   end width
> [1] 11869 11900    32
> [2] 11882 11913    32
> [3] 12245 12276    32
> [4] 12554 12585    32
> [5] 12555 12586    32
> [6] 12557 12588    32
>
> 2. bounds: An IRange object that indicates the start and end position of my
> gene -- I want all the reads that land here:
>
> R> bounds
> IRanges instance:
>       start      end width
> [1] 12040238 12073571 33334
>
> Goal:
> I wan to build two vectors:
> (i) A vector of positions on my chromosome (having length == to the length
> of my gene)
> (ii) A vector of counts over those positions
>
> Positions with 0 counts are removed from (i) and (ii)
>
> ==== Here's my code ====
>
> # Get the index of the reads from my sample.reads
> # object that lie within my gene bounds
> which.reads <- subjectHits(overlap(sample.reads, bounds))
>
> # Slice out these reads
> gene.reads <- reads[which.reads]
>
>
The above two steps could be simply:
> gene.reads <- sample.reads[bounds]


> # Build my coverage Rle vector
> cov <- coverage(gene.reads)
>
> # Only keep the part of the Rle that starts @ the start
> # of my gene to the end (I feel like doing this is weird (?))
> cov <- cov[start(bounds):length(cov)]
>
> # Create the non-Rle coverage vector that I will use down stream
> # to make a GenomeGraph::BaseTrack object
> counts <- numeric(width(bounds))
>
> # Populate this vector with the correct coverage counts
> counts[1:length(cov)] <- as.vector(cov)
>
>
As you say, this looks like it could be done easier using arguments to
coverage(). Maybe someone else could explain this?

# Build the base-position vector and filter out positions
> # with 0 counts from both vectors
> keep <- counts != 0
>
> # This is (i) from above
> bases <- (start(bounds):end(bounds))[keep]
>
>
A bit shorter:
bases <- as.integer(bounds)[keep]


> # This is (ii) from above
> counts <- counts[keep]
>
> =====================
>
> I feel like the coverage interface, using the start/end args would help
> here, no? Wouldn't this have worked to make things a bit easier?
>
> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))
>
> How would you do that by using the width/shift args? Or wouldn't you? Maybe
> I'm not clear on what the original start/end args in ``coverage`` are meant
> to do?
>
> Anyway, I was just curious if there was a better way to do what I need to
> do.
>
> Thanks,
>
> -steve
>
>
> ===== Here's the function in all of its glory, use it if you like =====
>
> Here's a sample call:
>
> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
>    gene.id='ENSG00000116688', title="MFN2")
>
> Here's the function:
>
> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL, gene.id
> =NULL,
>                                 biomart=NULL, cols=NULL, title=NULL) {
>  # Plots the read counts over a particular gene model.
>  #
>  # You can pass your own gene.model as an IRange object, or use biomaRt to
>  # retrieve the model using the Ensembl Gene ID gene.id
>  #
>  # Parameters
>  # ----------
>  # sample.reads : named list[IRanges] representing reads from separate
> samples
>  #                over a given chromosome (the one your gene is on!)
>  # gene.model   : IRanges object of start/end positions of the gene
>  # mart         : biomaRt, if you want to query ensembl for the gene model
>  # cols         : the colors you want to use for the reads in the
> respective
>  #                samples
>  if (is.null(gene.model) && is.null(gene.id)) {
>    stop("Need to get the gene.model from somewhere.")
>  }
>
>  # Figure out what type of gene.model we're using and get appropriate
> bounds
>  if (!is.null(gene.model)) {
>    gm <- makeGeneModel(start=start(gene.model), end=end(gene.model))
>    bounds <- range(gene.model)
>  } else {
>    if (is.null(biomart)) {
>      biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
>    }
>    gm <- makeGene(id=gene.id, type='ensembl_gene_id', biomart=biomart)
>    if (is.null(title)) {
>      title <- gene.id
>    }
>    anno <- g...@ens
>    bounds <- range(IRanges(start=anno$exon_chrom_start,
>                            end=anno$exon_chrom_end))
>  }
>
>  sample.names <- names(sample.reads)
>
>  if (is.null(cols)) {
>    cols <- rainbow(length(sample.reads))
>    names(cols) <- sample.names
>  }
>
>  if (is.null(title)) {
>    title <- 'Gene Reads'
>  }
>
>  stopifnot(length(cols) == length(sample.reads))
>  stopifnot(names(cols) == sample.names)
>
>  # Calculate the coverage vectors over the gene model for each sample
>  sample.coverage <- lapply(sample.names, function(name) {
>    reads <- sample.reads[[name]]
>    which.reads <- subjectHits(overlap(reads, bounds))
>    gene.reads <- reads[which.reads]
>    cov <- coverage(gene.reads)
>    cov <- cov[start(bounds):length(cov)]
>
>    counts <- numeric(width(bounds))
>    counts[1:length(cov)] <- as.vector(cov)
>    keep <- counts != 0
>    bases <- (start(bounds):end(bounds))[keep]
>    counts <- counts[keep]
>    list(pos=bases, coverage=counts)
>  })
>  names(sample.coverage) <- sample.names
>
>  # Make y-axis the same for each sample
>  max.counts <- max(sapply(sample.coverage, function(cov)
> max(cov$coverage)))
>
>  tracks <- lapply(sample.names, function(name) {
>    dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0, max.counts))
>    makeBaseTrack(base=sample.coverage[[name]]$pos,
>                  value=sample.coverage[[name]]$coverage,
>                  dp=dp)
>  })
>  names(tracks) <- names(sample.reads)
>
>  title <- makeTitle(text=title)
>  plot.me <- c(title, tracks, list(gm))
>
>  gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
> }
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  |  Memorial Sloan-Kettering Cancer Center
>  |  Weill Medical College of Cornell University
> Contact Info: 
> http://cbio.mskcc.org/~lianos/contact<http://cbio.mskcc.org/%7Elianos/contact>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Reply via email to