Re: [Bioc-sig-seq] AlignedRead and complex subsetting

Michael Lawrence Wed, 02 Sep 2009 12:05:18 -0700

Martin,
This %in% method looks great. Will this become part of ShortRead?

Michael


On Wed, Sep 2, 2009 at 11:19 AM, Martin Morgan <[email protected]> wrote:

> Ivan Gregoretti wrote:
> > Thank you very much everybody. It works great.
> >
> > It may be trivial for most list subscribers but for the record I'd
> > like to describe the whole mini-workflow. Notice that one command is
> > optional.
>
> Thanks Ivan for posting this, it's really helpful to see these work
> flows in action. One small change to my %in% method, for the record
>
> >
> > Ivan
> >
> > suppressMessages(library(ShortRead))
> > suppressMessages(library(rtracklayer))
> >
> > setMethod("%in%", c("AlignedRead", "RangesList"),
> >          function(x, table)
> > {
> >    ## consider only sensible alignemnts
> >    chr <- chromosome(x)
> >    pos <- position(x)
> >    wd <- width(x)
> >    notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
> >    ## find overlap
> >    chr <- chr[notNA]
> >    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
> >                            width=split(wd[notNA], chr)))
>
> Patrick suggested
>
>     rl <- split(IRanges(start=pos[notNA], width=wd[notNA]), chr)
>
> as a cleaner implementation of this step.
>
> >    olap <- rl %in% table
> >    ## map to original indicies
> >    len <- seq_len(length(x))
> >    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
> >    len %in% idx[unlist(olap)]
> > })
> >
> > # load tags
> > dirPath <- '/my/dir/'
> > pattern <- 's_1_export.txt'
> > aln <- readAligned(dirPath, pattern, 'SolexaExport')
> >
> > # load list of loci
> > myRegions <- import('myregions.bed')
> > # OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
> > names(myRegions) <- paste(names(myRegions), '.fa', sep='')
> > myRegionsRL <- ranges(myRegions)
> >
> > # segregate by in/out membership
> > alnIn <- aln[aln %in% myRegionsRL]
> > alnOut <- aln[!(aln %in% myRegionsRL)]
> >
> >
> >> sessionInfo()
> > R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
> > x86_64-unknown-linux-gnu
> >
> > locale:
> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> >  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > other attached packages:
> > [1] rtracklayer_1.5.12 RCurl_1.1-0        bitops_1.0-4.1
> ShortRead_1.3.27
> > [5] lattice_0.17-25    BSgenome_1.13.10   Biostrings_2.13.34
> IRanges_1.3.60
> >
> > loaded via a namespace (and not attached):
> > [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
> >
> >
> > Ivan Gregoretti, PhD
> > National Institute of Diabetes and Digestive and Kidney Diseases
> > National Institutes of Health
> > 5 Memorial Dr, Building 5, Room 205.
> > Bethesda, MD 20892. USA.
> > Phone: 1-301-496-1592
> > Fax: 1-301-496-9878
> >
> >
> >
> > On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<[email protected]> wrote:
> >> Ivan Gregoretti wrote:
> >>> Hello Steve, Martin and Everybody,
> >>>
> >>> First, thank you both for your help.
> >>>
> >>> As suggested by Steve, I look at
> >>>
> >>>
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
> >>>
> >>> by it didn't quite applied to my case. I need to iterate over
> >>> chromosomes and I need to keep strand information.
> >>>
> >>> I'll try this new algorithm Martin just wrote. By the way, Martin,
> >>> what kind of object should rangesList be?
> >> an instance of the RangesList class
> >>
> >>> Say that I load my regions like this
> >>>
> >>> myRegions <- import('myregions.bed')
> >>>
> >>> Can I coerce that IRanges instance myRegions to the rangestList class?
> How?
> >> I think ranges(myRegions).
> >>
> >> I've forgotten what your sessionInfo() is; mine is
> >>
> >>> sessionInfo()
> >> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
> >> x86_64-unknown-linux-gnu
> >>
> >> locale:
> >>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> >>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>
> >> other attached packages:
> >> [1] ShortRead_1.3.29   lattice_0.17-25    BSgenome_1.13.10
> >> Biostrings_2.13.34
> >> [5] IRanges_1.3.65     rtracklayer_1.5.12 RCurl_1.1-0
>  bitops_1.0-4.1
> >>
> >> loaded via a namespace (and not attached):
> >> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
> >>
> >>
> >> Martin
> >>
> >>> Thank you!
> >>>
> >>> Ivan
> >>>
> >>>
> >>> Ivan Gregoretti, PhD
> >>> National Institute of Diabetes and Digestive and Kidney Diseases
> >>> National Institutes of Health
> >>> 5 Memorial Dr, Building 5, Room 205.
> >>> Bethesda, MD 20892. USA.
> >>> Phone: 1-301-496-1592
> >>> Fax: 1-301-496-9878
> >>>
> >>>
> >>>
> >>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<[email protected]>
> wrote:
> >>>> Hi Ivan --
> >>>>
> >>>> Steve Lianoglou wrote:
> >>>>> Hi Ivan,
> >>>>>
> >>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
> >>>>>
> >>>>>> Hello Everybody,
> >>>>>>
> >>>>>> How do you subset an AlignedRead instance to keep (or reject) tags
> >>>>>> that lay within a set of genomic regions?
> >>>>>>
> >>>>>>
> >>>>>> Example
> >>>>>>
> >>>>>> Lets say that I have an AlignedRead instance called aln.
> >>>>>>
> >>>>>> Now let's say that I have a set of positions in BED style:
> >>>>>>
> >>>>>> (chromosome, start end)
> >>>>>> ch1 1000000 1000050
> >>>>>> chrX 20000000 20100000
> >>>>>> ...(many more)...
> >>>>>>
> >>>>>> We can imagine that I have the BED set loaded as a data frame.
> >>>>>>
> >>>>>> Is it possible to pick from aln only the tags within (or outside)
> the
> >>>>>> features defined in the table described above?
> >>>>> I think that you should convert your BED file to an IRanges object,
> and
> >>>>> use overlap with your ranges + your readAligned object to get what
> your
> >>>> I wrote this, as a trial implementation of %in%. Is that useful? (also
> >>>> !(... %in% ...) though that would, e.g., include NAs.
> >>>>
> >>>>
> >>>> setMethod("%in%", c("AlignedRead", "RangesList"),
> >>>>          function(x, table)
> >>>> {
> >>>>    ## consider only sensible alignemnts
> >>>>    chr <- chromosome(x)
> >>>>    pos <- position(x)
> >>>>    wd <- width(x)
> >>>>    notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
> >>>>    ## find overlap
> >>>>    chr <- chr[notNA]
> >>>>    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
> >>>>                            width=split(wd[notNA], chr)))
> >>>>    olap <- rl %in% table
> >>>>    ## map to original indicies
> >>>>    len <- seq_len(length(x))
> >>>>    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
> >>>>    len %in% idx[unlist(olap)]
> >>>> })
> >>>>
> >>>> One would then
> >>>>
> >>>>  aln[aln %in% rangesList]
> >>>>
> >>>> Martin
> >>>>
> >>>>> after. See Martin's post about something like this in this thread:
> >>>>>
> >>>>>
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
> >>>>>
> >>>>> To get the reads *outside* of your ranges, maybe you can call the
> >>>>> ``gaps`` on your bed/ranges and then do the same thing ...  or
> perhaps
> >>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your IRanges
> >>>>> converted alignedRead object (if necessary)).
> >>>>>
> >>>>> -steve
> >>>>>
> >>>>> --
> >>>>> Steve Lianoglou
> >>>>> Graduate Student: Computational Systems Biology
> >>>>>   |  Memorial Sloan-Kettering Cancer Center
> >>>>>   |  Weill Medical College of Cornell University
> >>>>> Contact Info: 
> >>>>> http://cbio.mskcc.org/~lianos/contact<http://cbio.mskcc.org/%7Elianos/contact>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioc-sig-sequencing mailing list
> >>>>> [email protected]
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>> _______________________________________________
> >>> Bioc-sig-sequencing mailing list
> >>> [email protected]
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] AlignedRead and complex subsetting

Reply via email to