Re: [Bioc-devel] C library or C package API for regular expressions

Hervé Pagès Mon, 25 Jan 2016 14:43:16 -0800

Hi Jiri,

On 01/25/2016 09:40 AM, Jiří Hon wrote:

Hi Martin


Dne 25.1.2016 v 13:08 Morgan, Martin napsal(a):

There is discussion at

http://stackoverflow.com/questions/23556205/using-boost-regex-with-rcpp

pointing to

http://gallery.rcpp.org/articles/boost-regular-expressions/

There is a Bioconductor example in that bundles the regex library at
 flowCore/src/

https://github.com/Bioconductor-mirror/flowCore

A second example is in the mzR package.


Thank you for pointing me to the flowCore and mzR packages, these
examples are really helpful.

A real question is, do you really need this functionality at the C
level?


I think it's unavoidable in my case for performance reasons. I'am trying
to dedect all possible overlapping motifs in DNA compounded from
elements matching some regular expression.


I think Martin's question is: are you sure you need this at the C
level? What makes you think that calling a regex engine from C will
perform better than calling it from R?

Note that using a regex for finding motifs in a DNA sequence has 2
fundamental problems:

(1) It doesn't always find all the matches. For example if 2 matches
    are overlapping, it only returns the 1st of the 2 matches:

  > library(Biostrings)

  > matchPattern("ATAAT", "CCATAATAATGATAAT")
    Views on a 16-letter BString subject
  subject: CCATAATAATGATAAT
  views:
      start end width
  [1]     3   7     5 [ATAAT]
  [2]     6  10     5 [ATAAT]
  [3]    12  16     5 [ATAAT]

  > gregexpr("ATAAT", "CCATAATAATGATAAT")[[1]]
  [1]  3 12
  attr(,"match.length")
  [1] 5 5
  attr(,"useBytes")
  [1] TRUE

(2) It's inefficient on a long DNA sequence:

  > library(BSgenome.Hsapiens.UCSC.hg19)
  > chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
  > system.time(m1 <- matchPattern("ATAAT", chr1))
     user  system elapsed
    0.946   0.000   0.940
  > chr1c <- as.character(chr1)
  > system.time(m2 <- gregexpr("ATAAT", chr1c)[[1]])
     user  system elapsed
    4.109   0.000   4.109

This was actually the very first motivating use case for developing
the Biostrings package. It's important to realize that using the regex
engine at the C level wouldn't make much difference.

matchPattern() and family don't support regex though. However when
working with DNA motifs, the motifs can often be described with IUPAC
ambiguity letters. For example, instead of describing the motifs
with regular expression AT(A|G|T|)T(A|C)GG.G, you can describe it with
ATDTMGGNG. Then you can use matchPattern() on this pattern and with
fixed=FALSE to find all the matches. Additionally you can use the
'max.mismatch' and/or 'with.indels' arguments to allow a small number
of mismatches and/or indels. See ?matchPattern for more information
and examples.

Of course this has its own limitations: you can only do this for a
subclass of regular expressions. For example regular expressions that
use * or + to allow for repetitions cannot be replaced by a sequence
with just IUPAC codes, so the string matching tools in Biostrings
cannnot be used in that case.

Cheers,
H.

A secondary question is that if several packages are using this
functionality, then perhaps the library could be bundled separately
and made available just once; zlibbioc does something like this (sort
of; zlib is only needed on Windows). The flowCore and mzR maintainers
(cc'd) might be a valuable resource in this regard.


Efficient regexp algorithms seems useful to me for solving many
bioinformatic problems. So it would be natural to have package with C
API to the most efficient regexp libraries.

Martin

________________________________________ From: Bioc-devel
<bioc-devel-boun...@r-project.org> on behalf of Jiří Hon
<xhonj...@stud.fit.vutbr.cz> Sent: Monday, January 25, 2016 4:33 AM
To: Charles Determan Cc: bioc-devel@r-project.org Subject: Re:
[Bioc-devel] C library or C package API for regular expressions

Hi Charles,

thank you a lot for your helpful hint. There is still a thing that
I'm not sure about - Boost manual says that Boost.Regex is not header
only [1]. So as BH package contains only headers, I will have to
bundle the Boost.Regex library into the package code anyway. Am I
right?

Jiri

[1]
http://www.boost.org/doc/libs/1_60_0/more/getting_started/unix-variants.html#header-only-libraries

Dne 23.1.2016 v 13:35 Charles Determan napsal(a):

Hi Jiri,

I believe you can use the BH package. It contains most of the
Boost

headers.


Regards, Charles

On Saturday, January 23, 2016, Jiří Hon
<xhonj...@stud.fit.vutbr.cz>

wrote:

Dear package developers,

I would like to ask you for advice. Please, what is the most
seamless way to use regular expressions in C/C++ code of
R/Bioconductor package? Is it allowed to bundle some C/C++
library for that (like PCRE or Boost.Regex)? Or is there existing
C API of some package I can depend on and import?

Thank you a lot for your attention and please have a nice day :)

Jiri Hon

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

This email message may contain legally privileged and/or confidential
information.  If you are not the intended recipient(s), or the
employee or agent responsible for the delivery of this message to the
intended recipient(s), you are hereby notified that any disclosure,
copying, distribution, or use of this email message is prohibited. If
you have received this message in error, please notify the sender
immediately by e-mail and delete this email message from your
computer. Thank you.


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] C library or C package API for regular expressions

Reply via email to