Re: [Bioc-devel] C library or C package API for regular expressions

2016-01-26 Thread Jiří Hon

Hi Dan,

nice to hear, I didn't notice. The only problem could be missing header 
files, but its bundling would solve it I hope.


Jirka

Dne 25.1.2016 v 23:38 Dan Tenenbaum napsal(a):

R requires PCRE to build, therefore perhaps it is available for use within 
packages?
Dan


- Original Message -

From: "Jiří Hon" 
To: "bioc-devel" 
Sent: Saturday, January 23, 2016 1:56:52 AM
Subject: [Bioc-devel] C library or C package API for regular expressions



Dear package developers,

I would like to ask you for advice. Please, what is the most seamless
way to use regular expressions in C/C++ code of R/Bioconductor package?
Is it allowed to bundle some C/C++ library for that (like PCRE or
Boost.Regex)? Or is there existing C API of some package I can depend on
and import?

Thank you a lot for your attention and please have a nice day :)

Jiri Hon

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] C library or C package API for regular expressions

2016-01-26 Thread Jiří Hon


Dne 25.1.2016 v 23:34 Hervé Pagès napsal(a):

Hi Jiri,

On 01/25/2016 09:40 AM, Jiří Hon wrote:

Hi Martin

Dne 25.1.2016 v 13:08 Morgan, Martin napsal(a):

There is discussion at

http://stackoverflow.com/questions/23556205/using-boost-regex-with-rcpp

pointing to

http://gallery.rcpp.org/articles/boost-regular-expressions/

There is a Bioconductor example in that bundles the regex library at
 flowCore/src/

https://github.com/Bioconductor-mirror/flowCore

A second example is in the mzR package.


Thank you for pointing me to the flowCore and mzR packages, these
examples are really helpful.


A real question is, do you really need this functionality at the C
level?


I think it's unavoidable in my case for performance reasons. I'am trying
to dedect all possible overlapping motifs in DNA compounded from
elements matching some regular expression.


I think Martin's question is: are you sure you need this at the C
level? What makes you think that calling a regex engine from C will
perform better than calling it from R?

Note that using a regex for finding motifs in a DNA sequence has 2
fundamental problems:

(1) It doesn't always find all the matches. For example if 2 matches
 are overlapping, it only returns the 1st of the 2 matches:

   > library(Biostrings)

   > matchPattern("ATAAT", "CCATAATAATGATAAT")
 Views on a 16-letter BString subject
   subject: CCATAATAATGATAAT
   views:
   start end width
   [1] 3   7 5 [ATAAT]
   [2] 6  10 5 [ATAAT]
   [3]12  16 5 [ATAAT]

   > gregexpr("ATAAT", "CCATAATAATGATAAT")[[1]]
   [1]  3 12
   attr(,"match.length")
   [1] 5 5
   attr(,"useBytes")
   [1] TRUE

(2) It's inefficient on a long DNA sequence:

   > library(BSgenome.Hsapiens.UCSC.hg19)
   > chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
   > system.time(m1 <- matchPattern("ATAAT", chr1))
  user  system elapsed
 0.946   0.000   0.940
   > chr1c <- as.character(chr1)
   > system.time(m2 <- gregexpr("ATAAT", chr1c)[[1]])
  user  system elapsed
 4.109   0.000   4.109

This was actually the very first motivating use case for developing
the Biostrings package. It's important to realize that using the regex
engine at the C level wouldn't make much difference.

matchPattern() and family don't support regex though. However when
working with DNA motifs, the motifs can often be described with IUPAC
ambiguity letters. For example, instead of describing the motifs
with regular expression AT(A|G|T|)T(A|C)GG.G, you can describe it with
ATDTMGGNG. Then you can use matchPattern() on this pattern and with
fixed=FALSE to find all the matches. Additionally you can use the
'max.mismatch' and/or 'with.indels' arguments to allow a small number
of mismatches and/or indels. See ?matchPattern for more information
and examples.

Of course this has its own limitations: you can only do this for a
subclass of regular expressions. For example regular expressions that
use * or + to allow for repetitions cannot be replaced by a sequence
with just IUPAC codes, so the string matching tools in Biostrings
cannnot be used in that case.

Cheers,
H.


Thank you Hervé for your tips. I'm aware of the limited power of regular 
expressions, but using matchPattern doesn't solves my problem. The 
reason for using regexp library at C level is that I plan to call it 
million times (on short DNA parts) and I suppose it would be better to 
avoid the calling and for-loop overhead. Therefore I wanted to get the 
idea about possible regex C APIs I can use or if its usually bundled.


Jirka






A secondary question is that if several packages are using this
functionality, then perhaps the library could be bundled separately
and made available just once; zlibbioc does something like this (sort
of; zlib is only needed on Windows). The flowCore and mzR maintainers
(cc'd) might be a valuable resource in this regard.


Efficient regexp algorithms seems useful to me for solving many
bioinformatic problems. So it would be natural to have package with C
API to the most efficient regexp libraries.


Martin

 From: Bioc-devel
 on behalf of Jiří Hon
 Sent: Monday, January 25, 2016 4:33 AM
To: Charles Determan Cc: bioc-devel@r-project.org Subject: Re:
[Bioc-devel] C library or C package API for regular expressions

Hi Charles,

thank you a lot for your helpful hint. There is still a thing that
I'm not sure about - Boost manual says that Boost.Regex is not header
only [1]. So as BH package contains only headers, I will have to
bundle the Boost.Regex library into the package code anyway. Am I
right?

Jiri

[1]
http://www.boost.org/doc/libs/1_60_0/more/getting_started/unix-variants.html#header-only-libraries







Dne 23.1.2016 v 13:35 Charles Determan napsal(a):

Hi Jiri,

I believe you can use the BH package. It contains most of the
Boost

headers.


Regards, Charles

On Saturday, January 23, 2016, Jiří Hon

[Bioc-devel] Announcing the EnrichmentBrowser 2.0

2016-01-26 Thread Ludwig Geistlinger
Dear Bioconductors,

I am delighted to announce a major re-release of the EnrichmentBrowser
package in line with its recent publication:

Geistlinger L, Csaba G, Zimmer R.
Bioconductor's EnrichmentBrowser: seamless navigation through combined
results of set- & network-based enrichment analysis.
BMC Bioinformatics, 17:45, Jan 2016.
http://doi.org/10.1186/s12859-016-0884-1


The EnrichmentBrowser is a meta-package implementing an analysis pipeline
for high-throughput gene expression data as measured with microarrays and
RNA-seq.
Functionality includes data preparation, preprocessing, differential
expression analysis, set- and network-based enrichment analysis,
combination as well as visualization and exploration of results.
In a workflow-like manner, the package brings together a selection of
finest Bioc packages which have shown to work distinghuishably well in
practice for the respective purposes.
Additional features of the package are the adaption of enrichment methods
using sample permutation for RNA-seq read count data, an improved
implementation of the network-based enrichment method GGEA (Geistlinger et
al., Bioinformatics, ISMB/ECCB 2011), and novel ways of combining and
exploring results across methods.

Comments and suggestions to further improve the EnrichmentBrowser are
highly appreciated.

In addition, I would be glad if a short announcement of the paper could
also be posted on the Bioc Twitter channel in order to make users aware.

Thx & Best,
Ludwig



-- 
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] seqlevelsStyle() warning-message by message-warning, and NCBI/Ensembl seq styles

2016-01-26 Thread Robert Castelo

hi Hervé,

that's great, thanks for implementing quickly these bits.

robert.

On 01/26/2016 01:19 AM, Hervé Pagès wrote:

Hi Robert,

I've made the following changes to the seqlevelsStyle() getter and
setter:
1) No more warning when the getter returns more than 1 compatible
style.
2) When more than 1 style is supplied, the setter uses the 1st one
only with a warning.

That should address the issues you reported below. I'm sure the behavior
of the seqlevelsStyle() setter could be refined when more than 1 style
is supplied but the new behavior should get us going for now.

These changes are in GenomeInfoDb 1.6.3 (release) and 1.7.5 (devel).

Note that the Ensembl mappings were contributed by Jo last week (thanks
Jo) and they're indeed the same as the NCBI mappings for Human but they
differ for all the other organisms supported by GenomeInfoDb at the
moment. Anyway, generally speaking, it sounds like the user should be
able to do seqlevelsStyle(x) <- "Ensembl" independently of whether this
will result in seqlevels that are the same as if s/he had done
seqlevelsStyle(x) <- "NCBI".

Cheers,
H.

On 01/25/2016 02:39 AM, Robert Castelo wrote:

hi,

i would like to ask if current line #142 of
GenomeInfoDb/R/seqlevelsStyle.R:

message("warning! Multiple seqlevels styles found.")

could be replaced by

warning("Multiple seqlevels styles found.")

since, after all, the message is a warning.

the reason for my request is that a recent update on GenomeInfoDb added
the 'Ensembl' sequence style and this executes the previous line when i
try to figure out the sequence style of a BAM file produced from human
sequence data and GATK. for instance:

path2bam <- file.path(system.file("extdata",
package="VariantFiltering"), "NA12878.subset.bam")

hdr <- scanBamHeader(path2bam)
names(hdr[[1]]$targets)
[1] "1" "2" "3" "4" "5"
[6] "6" "7" "8" "9" "10"
[11] "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20"
[21] "21" "22" "X" "Y" "MT"
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1" "NC_007605"
[86] "hs37d5"

seqlevelsStyle(names(hdr[[1]]$targets))
warning! Multiple seqlevels styles found.
[1] "NCBI" "Ensembl"

this is all fine in interactive mode so that the user is aware that
he/she may have to take action on this fact. however, at a specific
place of my package VariantFiltering i'd like to take action without
telling anything to the user. for that purpose i would like to use
suppressWarnings() but does not work currently with the message() call.
i could use suppressMessages() but would not prefer to.

on a side note, both NCBI and Ensembl sequence style seem to be
identical:

genomeStyles()$Homo_sapiens
circular auto sex NCBI UCSC dbSNP Ensembl
1 FALSE TRUE FALSE 1 chr1 ch1 1
2 FALSE TRUE FALSE 2 chr2 ch2 2
3 FALSE TRUE FALSE 3 chr3 ch3 3
4 FALSE TRUE FALSE 4 chr4 ch4 4
5 FALSE TRUE FALSE 5 chr5 ch5 5
6 FALSE TRUE FALSE 6 chr6 ch6 6
7 FALSE TRUE FALSE 7 chr7 ch7 7
8 FALSE TRUE FALSE 8 chr8 ch8 8
9 FALSE TRUE FALSE 9 chr9 ch9 9
10 FALSE TRUE FALSE 10 chr10 ch10 10
11 FALSE TRUE FALSE 11 chr11 ch11 11
12 FALSE TRUE FALSE 12 chr12 ch12 12
13 FALSE TRUE FALSE 13 chr13 ch13 13
14 FALSE TRUE FALSE 14 chr14 ch14 14
15 FALSE TRUE FALSE 15 chr15 ch15 15
16 FALSE TRUE FALSE 16 chr16 ch16 16
17 FALSE TRUE FALSE 17 chr17 ch17 17
18 FALSE TRUE FALSE 18 chr18 ch18 18
19 FALSE TRUE FALSE 19 chr19 ch19 19
20 FALSE TRUE FALSE 20 chr20 ch20 20
21 FALSE TRUE FALSE 21 chr21 ch21 21
22 FALSE TRUE FALSE 22 chr22 ch22 22
23 FALSE FALSE TRUE X chrX chX X
24 FALSE FALSE TRUE Y chrY chY Y
25 TRUE FALSE FALSE MT chrM chMT MT

this leads to the previous use case with a BAM file, and also to this
other one:

library(BSgenome.Hsapiens.NCBI.GRCh38)

seqlevelsStyle(BSgenome.Hsapiens.NCBI.GRCh38)
warning! Multiple seqlevels styles found.
[1] "NCBI" "Ensembl"

note that now if you have a UCSC-style GRanges object like this one:

gr <- GRanges(c("chr1", "chr2"), IRanges(c(10, 20), c(30, 40)))
seqlevelsStyle(gr)
[1] "UCSC"

that you want to use with the BSgenome object, the following simple
operation will not work anymore:

seqlevelsStyle(gr) <- seqlevelsStyle(BSgenome.Hsapiens.NCBI.GRCh38)
warning! Multiple seqlevels styles found.
Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) :
the 

Re: [Bioc-devel] Use of Matrix inside SummarizedExperiment

2016-01-26 Thread Peter Hickey
Thanks, Hervé!

On 26/01/2016, Hervé Pagès  wrote:
> Hi Pete,
>
> On 01/25/2016 12:32 PM, Peter Hickey wrote:
>> The Matrix virtual class in the Matrix package seems to mostly work as
>> an assays element in a SummarizedExperiment. This is especially useful
>> for data that can be efficiently represented as a sparse matrix, e.g.,
>> using the dgCMatrix class.
>>
>> My understanding is that this works because the (concrete subclasses
>> of) Matrix implement the necessary basic S4 methods to form a basic,
>> matrix-like API. However, there are a couple of edge cases that I'm
>> hoping it might be possible to smoothen out. Ideally, I'd love if this
>> could work for any class that implements a minimal matrix-like API
>> (I'm currently experimenting with such a class) and not just for the
>> Matrix virtual class and its concrete subclasses. From reading the
>> SummarizedExperiment code, it looks like the minimal methods required
>> for an element of a (concrete subclass of) Assays object would be dim,
>> dimnames, [, [<-, rbind, cbind, length. I suppose that if any
>> additional methods are added for the Assays virtual class (e.g., I
>> have an almost-complete combine,SummarizedExperiment-method that calls
>> a combine,Assays-method) then these matrix-like objects must also have
>> such methods defined to ensure relatively straightforward inheritance.
>>
>> Here are a couple of instances where a matrix and a Matrix behave
>> (understandably) differently but where it would be nice if it "just
>> worked". There may well be others, but I'd be interested to know
>> whether this is worth further pursuing.
>>
>> library(SummarizedExperiment)
>> library(Matrix)
>> m <- matrix(1:10, ncol = 2)
>> m2 <- Matrix(m)
>>
>> # SummarizedExperiment constructor has specialised matrix method.
>> se <- SummarizedExperiment(m)
>> # This won't work because there is no Matrix specialisation
>> se2 <- SummarizedExperiment(m2)
>> # But can get around this by wrapping the Matrix in a SimpleList to defer
>> to
>> # the SummarizedExperiment,SimpleList-method
>> se2 <- SummarizedExperiment(SimpleList(m2))
>
> Note that wrapping the Matrix in an ordinary list also works.
>
>> # I guess the only way around this is to write a SummarizedExperiment
>> method
>> # for every matrix-like class, which might be too much overhead for the
>> # SummarizedExperiment package to maintain. Perhaps there is another
>> solution,
>> # e.g., try wrapping the input in a call to SimpleList if no method found
>> and
>> # then deferring to the SimpleList method? Could be too messy to be worth
>> it ...
>
> The method for matrix already does this wrapping into a SimpleList
> object and then defers to the method for SimpleList method. I just
> replaced the current method for matrix by a method for ANY that does
> exactly the same thing. With this change, SummarizedExperiment() takes
> any matrix-like object.
>
>>
>> # assay<- dispatches on value (which must be a matrix)
>> assay(se) <- assay(se)
>> # Won't work because there is no Matrix specialisation
>> assay(se2) <- assay(se2)
>> # But using assays() does work
>> assays(se2)[[1]] <- assays(se2)[[1]]
>> # Could value be dropped from the assay<- signatuare and the object
>> validated
>> # during/following the consequent call to assays<-?
>
> That makes a lot of sense. Having the assay() setter dispatch on 'x',
> 'i', and 'value' has no real benefit. Dispatching on 'x' and 'i' is
> enough and allows the assay() setter to take any matrix-like object as
> long as the resulting SummarizedExperiment object is valid.
>
> These 2 changes are in SummarizedExperiment 1.1.17.
>
> Thanks for the suggestions,
> H.
>
>>
>> Cheers,
>> Pete
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] C library or C package API for regular expressions

2016-01-26 Thread Gabe Becker
Jirka,

Do you mean with millions of different patterns (motifs)? If not, the
R-level regular expression functions are vectorized, and so the looping
will already happen for you in C.

Also, have you confirmed that the R evaluation overhead will actually
dominate the pattern matching here if you just do it in R? That very well
may be, but it's not obvious to me that it would depending on details about
what you're doing that I'm not privy to.

Best,
~G

On Tue, Jan 26, 2016 at 3:25 AM, Jiří Hon 
wrote:

> Hi Dan,
>
> nice to hear, I didn't notice. The only problem could be missing header
> files, but its bundling would solve it I hope.
>
> Jirka
>
> Dne 25.1.2016 v 23:38 Dan Tenenbaum napsal(a):
>
> R requires PCRE to build, therefore perhaps it is available for use within
>> packages?
>> Dan
>>
>>
>> - Original Message -
>>
>>> From: "Jiří Hon" 
>>> To: "bioc-devel" 
>>> Sent: Saturday, January 23, 2016 1:56:52 AM
>>> Subject: [Bioc-devel] C library or C package API for regular expressions
>>>
>>
>> Dear package developers,
>>>
>>> I would like to ask you for advice. Please, what is the most seamless
>>> way to use regular expressions in C/C++ code of R/Bioconductor package?
>>> Is it allowed to bundle some C/C++ library for that (like PCRE or
>>> Boost.Regex)? Or is there existing C API of some package I can depend on
>>> and import?
>>>
>>> Thank you a lot for your attention and please have a nice day :)
>>>
>>> Jiri Hon
>>>
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



-- 
Gabriel Becker, Ph.D
Associate Scientist
Bioinformatics and Computational Biology
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Warnings on moscato2 SinglePackage Builder

2016-01-26 Thread Karen Oróstica
Hi

I have the same problem with moscato 2, but I don't know the reason. Anyone
know how to fix it?

Best,

Karen Oróstica

2016-01-20 9:00 GMT-03:00 Thomas Lin Pedersen :

> Hi
>
> My submitted package, PanVizGenerator, are building fine on all three
> systems but gets some warnings during CHECK unique in moscato2 (Windows)
> that I’m not able to reproduce. The warnings are:
>
> Warning: multiple methods tables found for 'unlist'
> Warning: multiple methods tables found for 'as.vector'
> Warning: multiple methods tables found for 'unlist'
> Warning: multiple methods tables found for 'as.vector'
> Warning: multiple methods tables found for 'unlist'
> Is this a problem with moscato2? I don’t get this if I build locally…
>
> best
>
> Thomas
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel




-- 
Karen Oróstica,
bioinformatics engineer
Technical assistant
Genomed Lab
University of Chile

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel