Martin, This %in% method looks great. Will this become part of ShortRead? Michael
On Wed, Sep 2, 2009 at 11:19 AM, Martin Morgan <[email protected]> wrote: > Ivan Gregoretti wrote: > > Thank you very much everybody. It works great. > > > > It may be trivial for most list subscribers but for the record I'd > > like to describe the whole mini-workflow. Notice that one command is > > optional. > > Thanks Ivan for posting this, it's really helpful to see these work > flows in action. One small change to my %in% method, for the record > > > > > Ivan > > > > suppressMessages(library(ShortRead)) > > suppressMessages(library(rtracklayer)) > > > > setMethod("%in%", c("AlignedRead", "RangesList"), > > function(x, table) > > { > > ## consider only sensible alignemnts > > chr <- chromosome(x) > > pos <- position(x) > > wd <- width(x) > > notNA <- !(is.na(chr) | is.na(pos) | is.na(wd)) > > ## find overlap > > chr <- chr[notNA] > > rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr), > > width=split(wd[notNA], chr))) > > Patrick suggested > > rl <- split(IRanges(start=pos[notNA], width=wd[notNA]), chr) > > as a cleaner implementation of this step. > > > olap <- rl %in% table > > ## map to original indicies > > len <- seq_len(length(x)) > > idx <- unlist(split(len[notNA], chr), use.names=FALSE) > > len %in% idx[unlist(olap)] > > }) > > > > # load tags > > dirPath <- '/my/dir/' > > pattern <- 's_1_export.txt' > > aln <- readAligned(dirPath, pattern, 'SolexaExport') > > > > # load list of loci > > myRegions <- import('myregions.bed') > > # OPTIONAL name correction, ie from 'chr1' to 'chr1.fa' > > names(myRegions) <- paste(names(myRegions), '.fa', sep='') > > myRegionsRL <- ranges(myRegions) > > > > # segregate by in/out membership > > alnIn <- aln[aln %in% myRegionsRL] > > alnOut <- aln[!(aln %in% myRegionsRL)] > > > > > >> sessionInfo() > > R version 2.10.0 Under development (unstable) (2009-08-12 r49169) > > x86_64-unknown-linux-gnu > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] rtracklayer_1.5.12 RCurl_1.1-0 bitops_1.0-4.1 > ShortRead_1.3.27 > > [5] lattice_0.17-25 BSgenome_1.13.10 Biostrings_2.13.34 > IRanges_1.3.60 > > > > loaded via a namespace (and not attached): > > [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0 > > > > > > Ivan Gregoretti, PhD > > National Institute of Diabetes and Digestive and Kidney Diseases > > National Institutes of Health > > 5 Memorial Dr, Building 5, Room 205. > > Bethesda, MD 20892. USA. > > Phone: 1-301-496-1592 > > Fax: 1-301-496-9878 > > > > > > > > On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<[email protected]> wrote: > >> Ivan Gregoretti wrote: > >>> Hello Steve, Martin and Everybody, > >>> > >>> First, thank you both for your help. > >>> > >>> As suggested by Steve, I look at > >>> > >>> > https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html > >>> > >>> by it didn't quite applied to my case. I need to iterate over > >>> chromosomes and I need to keep strand information. > >>> > >>> I'll try this new algorithm Martin just wrote. By the way, Martin, > >>> what kind of object should rangesList be? > >> an instance of the RangesList class > >> > >>> Say that I load my regions like this > >>> > >>> myRegions <- import('myregions.bed') > >>> > >>> Can I coerce that IRanges instance myRegions to the rangestList class? > How? > >> I think ranges(myRegions). > >> > >> I've forgotten what your sessionInfo() is; mine is > >> > >>> sessionInfo() > >> R version 2.10.0 Under development (unstable) (2009-09-01 r49517) > >> x86_64-unknown-linux-gnu > >> > >> locale: > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] ShortRead_1.3.29 lattice_0.17-25 BSgenome_1.13.10 > >> Biostrings_2.13.34 > >> [5] IRanges_1.3.65 rtracklayer_1.5.12 RCurl_1.1-0 > bitops_1.0-4.1 > >> > >> loaded via a namespace (and not attached): > >> [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0 > >> > >> > >> Martin > >> > >>> Thank you! > >>> > >>> Ivan > >>> > >>> > >>> Ivan Gregoretti, PhD > >>> National Institute of Diabetes and Digestive and Kidney Diseases > >>> National Institutes of Health > >>> 5 Memorial Dr, Building 5, Room 205. > >>> Bethesda, MD 20892. USA. > >>> Phone: 1-301-496-1592 > >>> Fax: 1-301-496-9878 > >>> > >>> > >>> > >>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<[email protected]> > wrote: > >>>> Hi Ivan -- > >>>> > >>>> Steve Lianoglou wrote: > >>>>> Hi Ivan, > >>>>> > >>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote: > >>>>> > >>>>>> Hello Everybody, > >>>>>> > >>>>>> How do you subset an AlignedRead instance to keep (or reject) tags > >>>>>> that lay within a set of genomic regions? > >>>>>> > >>>>>> > >>>>>> Example > >>>>>> > >>>>>> Lets say that I have an AlignedRead instance called aln. > >>>>>> > >>>>>> Now let's say that I have a set of positions in BED style: > >>>>>> > >>>>>> (chromosome, start end) > >>>>>> ch1 1000000 1000050 > >>>>>> chrX 20000000 20100000 > >>>>>> ...(many more)... > >>>>>> > >>>>>> We can imagine that I have the BED set loaded as a data frame. > >>>>>> > >>>>>> Is it possible to pick from aln only the tags within (or outside) > the > >>>>>> features defined in the table described above? > >>>>> I think that you should convert your BED file to an IRanges object, > and > >>>>> use overlap with your ranges + your readAligned object to get what > your > >>>> I wrote this, as a trial implementation of %in%. Is that useful? (also > >>>> !(... %in% ...) though that would, e.g., include NAs. > >>>> > >>>> > >>>> setMethod("%in%", c("AlignedRead", "RangesList"), > >>>> function(x, table) > >>>> { > >>>> ## consider only sensible alignemnts > >>>> chr <- chromosome(x) > >>>> pos <- position(x) > >>>> wd <- width(x) > >>>> notNA <- !(is.na(chr) | is.na(pos) | is.na(wd)) > >>>> ## find overlap > >>>> chr <- chr[notNA] > >>>> rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr), > >>>> width=split(wd[notNA], chr))) > >>>> olap <- rl %in% table > >>>> ## map to original indicies > >>>> len <- seq_len(length(x)) > >>>> idx <- unlist(split(len[notNA], chr), use.names=FALSE) > >>>> len %in% idx[unlist(olap)] > >>>> }) > >>>> > >>>> One would then > >>>> > >>>> aln[aln %in% rangesList] > >>>> > >>>> Martin > >>>> > >>>>> after. See Martin's post about something like this in this thread: > >>>>> > >>>>> > https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html > >>>>> > >>>>> To get the reads *outside* of your ranges, maybe you can call the > >>>>> ``gaps`` on your bed/ranges and then do the same thing ... or > perhaps > >>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your IRanges > >>>>> converted alignedRead object (if necessary)). > >>>>> > >>>>> -steve > >>>>> > >>>>> -- > >>>>> Steve Lianoglou > >>>>> Graduate Student: Computational Systems Biology > >>>>> | Memorial Sloan-Kettering Cancer Center > >>>>> | Weill Medical College of Cornell University > >>>>> Contact Info: > >>>>> http://cbio.mskcc.org/~lianos/contact<http://cbio.mskcc.org/%7Elianos/contact> > >>>>> > >>>>> _______________________________________________ > >>>>> Bioc-sig-sequencing mailing list > >>>>> [email protected] > >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >>> _______________________________________________ > >>> Bioc-sig-sequencing mailing list > >>> [email protected] > >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >> > > > > _______________________________________________ > > Bioc-sig-sequencing mailing list > > [email protected] > > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
