Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
You could eventually point your student to MaskedXString and oligonucleotideFrequency(). You can mask the repeats and then just run the latter to count the N-mers. Comparing their original code to the code based on existing high-level utilities might be a useful exercise. Michael On Wed, Apr 12, 2017 at 8:24 PM, Pariksheet Nandawrote: > On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagès wrote: >> >> This is the expected behavior. >> >> Some background: BSgenomeViews are list-like objects where the *list >> elements* (i.e. the elements one extracts with [[) are the DNA >> sequences from the views > --snip-- >> The important difference is that with [[ I get a DNAString object >> (the content of the view) and with [ I get a BSgenomeViews object >> of length 1. > > Thank you, Hervé! > > I was failing to make the connection with the `[[` accessor. > > > On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence > wrote: >> >> I'm curious as to why you are looping over the views in the first >> place. Maybe we could arrive at a vectorized solution, which is often >> but not always simpler and faster. > > Hi Michael! > > Broad background is I'm acculturating an undergraduate student to writing a > bioconductor package and applying software engineering practices of version > control, unit testing, documenting, dependency setup and validation in a > different environment on our university HPC cluster, etc. The student also > came along to LibrePlanet to better understand the culture of software > freedom :o) The package goal is to use Biostrings to look for repeating > DNA sequences of a fixed kmer size and subset to portions of the genome > without repeats (an aligner can do this ofc, but the goal is to teach R and > engineering practices). > > I appreciate your thoughtfulness for vectorizing the code to best use > BSgenomeViews, but please don't spend more than 10 minutes as I have to > balance changes to the code with the student's learning and coding "voice" > and may not do proper justice for more of your effort. My slowness to > reply was getting the project further along to be more understandable. > Here was the line which I've updating as Hervé suggested to use seq_along(): > https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185 > (I'm having a hard time thinking of how to summarizing a small example out > of context). > Although in that line ranges_hits() is only operating on single indices, > ranges_hits() was written to process groups of indices to reduce > multi-processor communication. Generating such sets of indices would > involve applying width() to the views inside mappable() to break in into > chunks of, say, a million bases for matchPDict(). Again, I'm linking to > the code for anything that stands out at you, but I will feel bad if you > spend a lot of time on it. > > >> H. > >> Michael > > Pariksheet > > [[alternative HTML version deleted]] > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagèswrote: > > This is the expected behavior. > > Some background: BSgenomeViews are list-like objects where the *list > elements* (i.e. the elements one extracts with [[) are the DNA > sequences from the views --snip-- > The important difference is that with [[ I get a DNAString object > (the content of the view) and with [ I get a BSgenomeViews object > of length 1. Thank you, Hervé! I was failing to make the connection with the `[[` accessor. On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence wrote: > > I'm curious as to why you are looping over the views in the first > place. Maybe we could arrive at a vectorized solution, which is often > but not always simpler and faster. Hi Michael! Broad background is I'm acculturating an undergraduate student to writing a bioconductor package and applying software engineering practices of version control, unit testing, documenting, dependency setup and validation in a different environment on our university HPC cluster, etc. The student also came along to LibrePlanet to better understand the culture of software freedom :o) The package goal is to use Biostrings to look for repeating DNA sequences of a fixed kmer size and subset to portions of the genome without repeats (an aligner can do this ofc, but the goal is to teach R and engineering practices). I appreciate your thoughtfulness for vectorizing the code to best use BSgenomeViews, but please don't spend more than 10 minutes as I have to balance changes to the code with the student's learning and coding "voice" and may not do proper justice for more of your effort. My slowness to reply was getting the project further along to be more understandable. Here was the line which I've updating as Hervé suggested to use seq_along(): https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185 (I'm having a hard time thinking of how to summarizing a small example out of context). Although in that line ranges_hits() is only operating on single indices, ranges_hits() was written to process groups of indices to reduce multi-processor communication. Generating such sets of indices would involve applying width() to the views inside mappable() to break in into chunks of, say, a million bases for matchPDict(). Again, I'm linking to the code for anything that stands out at you, but I will feel bad if you spend a lot of time on it. > H. > Michael Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
I'm curious as to why you are looping over the views in the first place. Maybe we could arrive at a vectorized solution, which is often but not always simpler and faster. Michael On Wed, Apr 5, 2017 at 8:13 PM, Pariksheet Nandawrote: > Hi bioconductor devs, > > The BSgenomeViews class has been very useful in efficiently propagating > metadata for running Biostring operations. I noticed something unexpected > when iterating over views - it seems to return the Biostrings object > instead of a single length Views object, and thus loses the associated view > metadata. Is this intentional? Below is some example code, the output and > sessionInfo(). Yes, I also confirmed this happens in the development > version of R / bioconductor 3.5. > > On a side note, for unit testing it's been difficult to mock a BSgenome > object due to the link to physical files, and as a workaround I use a > small, arbitrary BSgenome package. Can one construct a BSgenome from its > package bundled extdata? The man page examples use packaged genomes. > > Code to reproduce the issue: > > -- > library(BSgenome) > genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19") > gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000")) > views <- Views(genome, gr) > views > lapply(views, class) > -- > > Result: > > -- >> views > BSgenomeViews object with 2 views and 0 metadata columns: > seqnames ranges strand dna > > [1] chr1 [25001, 28000] * [GCTTCAGCCT...TTATTTATTG] > [2] chr2 [30001, 31000] * [GACCCTCCTG...AGCAGGTGGT] > --- > seqinfo: 93 sequences (1 circular) from hg19 genome >> lapply(views, class) > [[1]] > [1] "DNAString" > attr(,"package") > [1] "Biostrings" > > [[2]] > [1] "DNAString" > attr(,"package") > [1] "Biostrings" > >> > -- > > Tested against these configurations: > 1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4) > 2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel > bioconductor 3.5) > > sessionInfo for configuration #2 above: > -- >> sessionInfo() > R Under development (unstable) (2017-04-05 r72488) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 16.04.2 LTS > > Matrix products: default > BLAS: > /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so > LAPACK: > /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats4parallel stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7 > [3] rtracklayer_1.35.10 Biostrings_2.43.7 > [5] XVector_0.15.2GenomicRanges_1.27.23 > [7] GenomeInfoDb_1.11.10 IRanges_2.9.19 > [9] S4Vectors_0.13.15 BiocGenerics_0.21.3 > > loaded via a namespace (and not attached): > [1] zlibbioc_1.21.0GenomicAlignments_1.11.12 > [3] BiocParallel_1.9.5 lattice_0.20-35 > [5] tools_3.5.0SummarizedExperiment_1.5.7 > [7] grid_3.5.0 Biobase_2.35.1 > [9] matrixStats_0.52.1 Matrix_1.2-9 > [11] GenomeInfoDbData_0.99.0bitops_1.0-6 > [13] RCurl_1.95-4.8 DelayedArray_0.1.7 > [15] compiler_3.5.0 Rsamtools_1.27.15 > [17] XML_3.98-1.6 >> BiocInstaller::biocValid() > [1] TRUE >> > > --- > Pariksheet Nanda > PhD Candidate in Genetics and Genomics > System Administrator, Storrs HPC Cluster > University of Connecticut > > [[alternative HTML version deleted]] > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
Hi Pariksheet, This is the expected behavior. Some background: BSgenomeViews are list-like objects where the *list elements* (i.e. the elements one extracts with [[) are the DNA sequences from the views: > views BSgenomeViews object with 2 views and 0 metadata columns: seqnames ranges strand dna [1] chr1 [25001, 28000] * [GCTTCAGCCT...TTATTTATTG] [2] chr2 [30001, 31000] * [GACCCTCCTG...AGCAGGTGGT] --- seqinfo: 93 sequences (1 circular) from hg19 genome > views[[1]] 3000-letter "DNAString" instance seq: GCTTCAGCCTGCACAGATAAGTAACAGA...CTGTCGTCTGAATTCCAAGCTGTTATTTATTG > views[[2]] 1000-letter "DNAString" instance seq: GACCCTCCTGTTGGCTTACAAGACTGCTATGT...TGCTGGGACAGTCATGGGCAAACCAGAGCAGGTGGT When subsetting with [ I get: > views[1] BSgenomeViews object with 1 view and 0 metadata columns: seqnames ranges strand dna [1] chr1 [25001, 28000] * [GCTTCAGCCT...TTATTTATTG] --- seqinfo: 93 sequences (1 circular) from hg19 genome > views[2] BSgenomeViews object with 1 view and 0 metadata columns: seqnames ranges strand dna [1] chr2 [30001, 31000] * [GACCCTCCTG...AGCAGGTGGT] --- seqinfo: 93 sequences (1 circular) from hg19 genome The important difference is that with [[ I get a DNAString object (the content of the view) and with [ I get a BSgenomeViews object of length 1. The semantic of lapply() is to always use [[ internally to extract elements. This is why 'lapply(views, class)' returns "DNAString". If you want to operate on the length-one Views objects, you need to do: lapply(seq_along(views), function(i) { do something on views[i] }) Hope this helps, H. On 04/05/2017 08:13 PM, Pariksheet Nanda wrote: Hi bioconductor devs, The BSgenomeViews class has been very useful in efficiently propagating metadata for running Biostring operations. I noticed something unexpected when iterating over views - it seems to return the Biostrings object instead of a single length Views object, and thus loses the associated view metadata. Is this intentional? Below is some example code, the output and sessionInfo(). Yes, I also confirmed this happens in the development version of R / bioconductor 3.5. On a side note, for unit testing it's been difficult to mock a BSgenome object due to the link to physical files, and as a workaround I use a small, arbitrary BSgenome package. Can one construct a BSgenome from its package bundled extdata? The man page examples use packaged genomes. Code to reproduce the issue: -- library(BSgenome) genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19") gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000")) views <- Views(genome, gr) views lapply(views, class) -- Result: -- views BSgenomeViews object with 2 views and 0 metadata columns: seqnames ranges strand dna [1] chr1 [25001, 28000] * [GCTTCAGCCT...TTATTTATTG] [2] chr2 [30001, 31000] * [GACCCTCCTG...AGCAGGTGGT] --- seqinfo: 93 sequences (1 circular) from hg19 genome lapply(views, class) [[1]] [1] "DNAString" attr(,"package") [1] "Biostrings" [[2]] [1] "DNAString" attr(,"package") [1] "Biostrings" -- Tested against these configurations: 1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4) 2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel bioconductor 3.5) sessionInfo for configuration #2 above: -- sessionInfo() R Under development (unstable) (2017-04-05 r72488) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so LAPACK: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7 [3]
[Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
Hi bioconductor devs, The BSgenomeViews class has been very useful in efficiently propagating metadata for running Biostring operations. I noticed something unexpected when iterating over views - it seems to return the Biostrings object instead of a single length Views object, and thus loses the associated view metadata. Is this intentional? Below is some example code, the output and sessionInfo(). Yes, I also confirmed this happens in the development version of R / bioconductor 3.5. On a side note, for unit testing it's been difficult to mock a BSgenome object due to the link to physical files, and as a workaround I use a small, arbitrary BSgenome package. Can one construct a BSgenome from its package bundled extdata? The man page examples use packaged genomes. Code to reproduce the issue: -- library(BSgenome) genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19") gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000")) views <- Views(genome, gr) views lapply(views, class) -- Result: -- > views BSgenomeViews object with 2 views and 0 metadata columns: seqnames ranges strand dna [1] chr1 [25001, 28000] * [GCTTCAGCCT...TTATTTATTG] [2] chr2 [30001, 31000] * [GACCCTCCTG...AGCAGGTGGT] --- seqinfo: 93 sequences (1 circular) from hg19 genome > lapply(views, class) [[1]] [1] "DNAString" attr(,"package") [1] "Biostrings" [[2]] [1] "DNAString" attr(,"package") [1] "Biostrings" > -- Tested against these configurations: 1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4) 2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel bioconductor 3.5) sessionInfo for configuration #2 above: -- > sessionInfo() R Under development (unstable) (2017-04-05 r72488) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so LAPACK: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7 [3] rtracklayer_1.35.10 Biostrings_2.43.7 [5] XVector_0.15.2GenomicRanges_1.27.23 [7] GenomeInfoDb_1.11.10 IRanges_2.9.19 [9] S4Vectors_0.13.15 BiocGenerics_0.21.3 loaded via a namespace (and not attached): [1] zlibbioc_1.21.0GenomicAlignments_1.11.12 [3] BiocParallel_1.9.5 lattice_0.20-35 [5] tools_3.5.0SummarizedExperiment_1.5.7 [7] grid_3.5.0 Biobase_2.35.1 [9] matrixStats_0.52.1 Matrix_1.2-9 [11] GenomeInfoDbData_0.99.0bitops_1.0-6 [13] RCurl_1.95-4.8 DelayedArray_0.1.7 [15] compiler_3.5.0 Rsamtools_1.27.15 [17] XML_3.98-1.6 > BiocInstaller::biocValid() [1] TRUE > --- Pariksheet Nanda PhD Candidate in Genetics and Genomics System Administrator, Storrs HPC Cluster University of Connecticut [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel