Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

2017-04-13 Thread Michael Lawrence
You could eventually point your student to MaskedXString and
oligonucleotideFrequency(). You can mask the repeats and then just run
the latter to count the N-mers. Comparing their original code to the
code based on existing high-level utilities might be a useful
exercise.

Michael


On Wed, Apr 12, 2017 at 8:24 PM, Pariksheet Nanda
 wrote:
> On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagès  wrote:
>>
>> This is the expected behavior.
>>
>> Some background: BSgenomeViews are list-like objects where the *list
>> elements* (i.e. the elements one extracts with [[) are the DNA
>> sequences from the views
> --snip--
>> The important difference is that with [[ I get a DNAString object
>> (the content of the view) and with [ I get a BSgenomeViews object
>> of length 1.
>
> Thank you, Hervé!
>
> I was failing to make the connection with the `[[` accessor.
>
>
> On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence 
> wrote:
>>
>> I'm curious as to why you are looping over the views in the first
>> place. Maybe we could arrive at a vectorized solution, which is often
>> but not always simpler and faster.
>
> Hi Michael!
>
> Broad background is I'm acculturating an undergraduate student to writing a
> bioconductor package and applying software engineering practices of version
> control, unit testing, documenting, dependency setup and validation in a
> different environment on our university HPC cluster, etc.  The student also
> came along to LibrePlanet to better understand the culture of software
> freedom :o)  The package goal is to use Biostrings to look for repeating
> DNA sequences of a fixed kmer size and subset to portions of the genome
> without repeats (an aligner can do this ofc, but the goal is to teach R and
> engineering practices).
>
> I appreciate your thoughtfulness for vectorizing the code to best use
> BSgenomeViews, but please don't spend more than 10 minutes as I have to
> balance changes to the code with the student's learning and coding "voice"
> and may not do proper justice for more of your effort.  My slowness to
> reply was getting the project further along to be more understandable.
> Here was the line which I've updating as Hervé suggested to use seq_along():
> https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185
> (I'm having a hard time thinking of how to summarizing a small example out
> of context).
> Although in that line ranges_hits() is only operating on single indices,
> ranges_hits() was written to process groups of indices to reduce
> multi-processor communication.  Generating such sets of indices would
> involve applying width() to the views inside mappable() to break in into
> chunks of, say, a million bases for matchPDict().  Again, I'm linking to
> the code for anything that stands out at you, but I will feel bad if you
> spend a lot of time on it.
>
>
>> H.
>
>> Michael
>
> Pariksheet
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

2017-04-12 Thread Pariksheet Nanda
On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagès  wrote:
>
> This is the expected behavior.
>
> Some background: BSgenomeViews are list-like objects where the *list
> elements* (i.e. the elements one extracts with [[) are the DNA
> sequences from the views
--snip--
> The important difference is that with [[ I get a DNAString object
> (the content of the view) and with [ I get a BSgenomeViews object
> of length 1.

Thank you, Hervé!

I was failing to make the connection with the `[[` accessor.


On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence 
wrote:
>
> I'm curious as to why you are looping over the views in the first
> place. Maybe we could arrive at a vectorized solution, which is often
> but not always simpler and faster.

Hi Michael!

Broad background is I'm acculturating an undergraduate student to writing a
bioconductor package and applying software engineering practices of version
control, unit testing, documenting, dependency setup and validation in a
different environment on our university HPC cluster, etc.  The student also
came along to LibrePlanet to better understand the culture of software
freedom :o)  The package goal is to use Biostrings to look for repeating
DNA sequences of a fixed kmer size and subset to portions of the genome
without repeats (an aligner can do this ofc, but the goal is to teach R and
engineering practices).

I appreciate your thoughtfulness for vectorizing the code to best use
BSgenomeViews, but please don't spend more than 10 minutes as I have to
balance changes to the code with the student's learning and coding "voice"
and may not do proper justice for more of your effort.  My slowness to
reply was getting the project further along to be more understandable.
Here was the line which I've updating as Hervé suggested to use seq_along():
https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185
(I'm having a hard time thinking of how to summarizing a small example out
of context).
Although in that line ranges_hits() is only operating on single indices,
ranges_hits() was written to process groups of indices to reduce
multi-processor communication.  Generating such sets of indices would
involve applying width() to the views inside mappable() to break in into
chunks of, say, a million bases for matchPDict().  Again, I'm linking to
the code for anything that stands out at you, but I will feel bad if you
spend a lot of time on it.


> H.

> Michael

Pariksheet

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

2017-04-06 Thread Michael Lawrence
I'm curious as to why you are looping over the views in the first
place. Maybe we could arrive at a vectorized solution, which is often
but not always simpler and faster.

Michael

On Wed, Apr 5, 2017 at 8:13 PM, Pariksheet Nanda
 wrote:
> Hi bioconductor devs,
>
> The BSgenomeViews class has been very useful in efficiently propagating
> metadata for running Biostring operations.  I noticed something unexpected
> when iterating over views - it seems to return the Biostrings object
> instead of a single length Views object, and thus loses the associated view
> metadata.  Is this intentional?  Below is some example code, the output and
> sessionInfo().  Yes, I also confirmed this happens in the development
> version of R / bioconductor 3.5.
>
> On a side note, for unit testing it's been difficult to mock a BSgenome
> object due to the link to physical files, and as a workaround I use a
> small, arbitrary BSgenome package.  Can one construct a BSgenome from its
> package bundled extdata?  The man page examples use packaged genomes.
>
> Code to reproduce the issue:
>
> --
> library(BSgenome)
> genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19")
> gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000"))
> views <- Views(genome, gr)
> views
> lapply(views, class)
> --
>
> Result:
>
> --
>> views
> BSgenomeViews object with 2 views and 0 metadata columns:
>   seqnames ranges strand   dna
>  
>   [1] chr1 [25001, 28000]  * [GCTTCAGCCT...TTATTTATTG]
>   [2] chr2 [30001, 31000]  * [GACCCTCCTG...AGCAGGTGGT]
>   ---
>   seqinfo: 93 sequences (1 circular) from hg19 genome
>> lapply(views, class)
> [[1]]
> [1] "DNAString"
> attr(,"package")
> [1] "Biostrings"
>
> [[2]]
> [1] "DNAString"
> attr(,"package")
> [1] "Biostrings"
>
>>
> --
>
> Tested against these configurations:
> 1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4)
> 2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel
> bioconductor 3.5)
>
> sessionInfo for configuration #2 above:
> --
>> sessionInfo()
> R Under development (unstable) (2017-04-05 r72488)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.2 LTS
>
> Matrix products: default
> BLAS:
> /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so
> LAPACK:
> /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4parallel  stats graphics  grDevices utils datasets
> [8] methods   base
>
> other attached packages:
>  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7
>  [3] rtracklayer_1.35.10   Biostrings_2.43.7
>  [5] XVector_0.15.2GenomicRanges_1.27.23
>  [7] GenomeInfoDb_1.11.10  IRanges_2.9.19
>  [9] S4Vectors_0.13.15 BiocGenerics_0.21.3
>
> loaded via a namespace (and not attached):
>  [1] zlibbioc_1.21.0GenomicAlignments_1.11.12
>  [3] BiocParallel_1.9.5 lattice_0.20-35
>  [5] tools_3.5.0SummarizedExperiment_1.5.7
>  [7] grid_3.5.0 Biobase_2.35.1
>  [9] matrixStats_0.52.1 Matrix_1.2-9
> [11] GenomeInfoDbData_0.99.0bitops_1.0-6
> [13] RCurl_1.95-4.8 DelayedArray_0.1.7
> [15] compiler_3.5.0 Rsamtools_1.27.15
> [17] XML_3.98-1.6
>> BiocInstaller::biocValid()
> [1] TRUE
>>
>
> ---
> Pariksheet Nanda
> PhD Candidate in Genetics and Genomics
> System Administrator, Storrs HPC Cluster
> University of Connecticut
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

2017-04-06 Thread Hervé Pagès

Hi Pariksheet,

This is the expected behavior.

Some background: BSgenomeViews are list-like objects where the *list
elements* (i.e. the elements one extracts with [[) are the DNA
sequences from the views:

  > views
  BSgenomeViews object with 2 views and 0 metadata columns:
seqnames ranges strand   dna
   
[1] chr1 [25001, 28000]  * [GCTTCAGCCT...TTATTTATTG]
[2] chr2 [30001, 31000]  * [GACCCTCCTG...AGCAGGTGGT]
---
seqinfo: 93 sequences (1 circular) from hg19 genome

  > views[[1]]
3000-letter "DNAString" instance
  seq: 
GCTTCAGCCTGCACAGATAAGTAACAGA...CTGTCGTCTGAATTCCAAGCTGTTATTTATTG

  > views[[2]]
1000-letter "DNAString" instance
  seq: 
GACCCTCCTGTTGGCTTACAAGACTGCTATGT...TGCTGGGACAGTCATGGGCAAACCAGAGCAGGTGGT


When subsetting with [ I get:

  > views[1]
  BSgenomeViews object with 1 view and 0 metadata columns:
seqnames ranges strand   dna
   
[1] chr1 [25001, 28000]  * [GCTTCAGCCT...TTATTTATTG]
---
seqinfo: 93 sequences (1 circular) from hg19 genome

  > views[2]
  BSgenomeViews object with 1 view and 0 metadata columns:
seqnames ranges strand   dna
   
[1] chr2 [30001, 31000]  * [GACCCTCCTG...AGCAGGTGGT]
---
seqinfo: 93 sequences (1 circular) from hg19 genome

The important difference is that with [[ I get a DNAString object
(the content of the view) and with [ I get a BSgenomeViews object
of length 1.

The semantic of lapply() is to always use [[ internally to extract
elements. This is why 'lapply(views, class)' returns "DNAString".
If you want to operate on the length-one Views objects, you need to
do:

  lapply(seq_along(views), function(i) { do something on views[i] })

Hope this helps,
H.


On 04/05/2017 08:13 PM, Pariksheet Nanda wrote:

Hi bioconductor devs,

The BSgenomeViews class has been very useful in efficiently propagating
metadata for running Biostring operations.  I noticed something unexpected
when iterating over views - it seems to return the Biostrings object
instead of a single length Views object, and thus loses the associated view
metadata.  Is this intentional?  Below is some example code, the output and
sessionInfo().  Yes, I also confirmed this happens in the development
version of R / bioconductor 3.5.

On a side note, for unit testing it's been difficult to mock a BSgenome
object due to the link to physical files, and as a workaround I use a
small, arbitrary BSgenome package.  Can one construct a BSgenome from its
package bundled extdata?  The man page examples use packaged genomes.

Code to reproduce the issue:

--
library(BSgenome)
genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19")
gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000"))
views <- Views(genome, gr)
views
lapply(views, class)
--

Result:

--

views

BSgenomeViews object with 2 views and 0 metadata columns:
  seqnames ranges strand   dna
 
  [1] chr1 [25001, 28000]  * [GCTTCAGCCT...TTATTTATTG]
  [2] chr2 [30001, 31000]  * [GACCCTCCTG...AGCAGGTGGT]
  ---
  seqinfo: 93 sequences (1 circular) from hg19 genome

lapply(views, class)

[[1]]
[1] "DNAString"
attr(,"package")
[1] "Biostrings"

[[2]]
[1] "DNAString"
attr(,"package")
[1] "Biostrings"




--

Tested against these configurations:
1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4)
2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel
bioconductor 3.5)

sessionInfo for configuration #2 above:
--

sessionInfo()

R Under development (unstable) (2017-04-05 r72488)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS:
/share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so
LAPACK:
/share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7
 [3] 

[Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

2017-04-05 Thread Pariksheet Nanda
Hi bioconductor devs,

The BSgenomeViews class has been very useful in efficiently propagating
metadata for running Biostring operations.  I noticed something unexpected
when iterating over views - it seems to return the Biostrings object
instead of a single length Views object, and thus loses the associated view
metadata.  Is this intentional?  Below is some example code, the output and
sessionInfo().  Yes, I also confirmed this happens in the development
version of R / bioconductor 3.5.

On a side note, for unit testing it's been difficult to mock a BSgenome
object due to the link to physical files, and as a workaround I use a
small, arbitrary BSgenome package.  Can one construct a BSgenome from its
package bundled extdata?  The man page examples use packaged genomes.

Code to reproduce the issue:

--
library(BSgenome)
genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19")
gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000"))
views <- Views(genome, gr)
views
lapply(views, class)
--

Result:

--
> views
BSgenomeViews object with 2 views and 0 metadata columns:
  seqnames ranges strand   dna
 
  [1] chr1 [25001, 28000]  * [GCTTCAGCCT...TTATTTATTG]
  [2] chr2 [30001, 31000]  * [GACCCTCCTG...AGCAGGTGGT]
  ---
  seqinfo: 93 sequences (1 circular) from hg19 genome
> lapply(views, class)
[[1]]
[1] "DNAString"
attr(,"package")
[1] "Biostrings"

[[2]]
[1] "DNAString"
attr(,"package")
[1] "Biostrings"

>
--

Tested against these configurations:
1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4)
2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel
bioconductor 3.5)

sessionInfo for configuration #2 above:
--
> sessionInfo()
R Under development (unstable) (2017-04-05 r72488)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS:
/share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so
LAPACK:
/share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7
 [3] rtracklayer_1.35.10   Biostrings_2.43.7
 [5] XVector_0.15.2GenomicRanges_1.27.23
 [7] GenomeInfoDb_1.11.10  IRanges_2.9.19
 [9] S4Vectors_0.13.15 BiocGenerics_0.21.3

loaded via a namespace (and not attached):
 [1] zlibbioc_1.21.0GenomicAlignments_1.11.12
 [3] BiocParallel_1.9.5 lattice_0.20-35
 [5] tools_3.5.0SummarizedExperiment_1.5.7
 [7] grid_3.5.0 Biobase_2.35.1
 [9] matrixStats_0.52.1 Matrix_1.2-9
[11] GenomeInfoDbData_0.99.0bitops_1.0-6
[13] RCurl_1.95-4.8 DelayedArray_0.1.7
[15] compiler_3.5.0 Rsamtools_1.27.15
[17] XML_3.98-1.6
> BiocInstaller::biocValid()
[1] TRUE
>

---
Pariksheet Nanda
PhD Candidate in Genetics and Genomics
System Administrator, Storrs HPC Cluster
University of Connecticut

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel