[issue26436] Add the regex-dna benchmark

Terry J. Reedy Sat, 27 Feb 2016 15:10:27 -0800

Terry J. Reedy added the comment:

DNA matching can be done with difflib.  Serious high-volume work should use 
compiled specialized matchers and aligners.


This particular benchmark, explained a bit at 
https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna,
 manipulates and searches standard FASTA format representations of sequences 
with the regex available in each language.  (The site has another Python 
implementation at 
https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=python3&id=1.
 It uses unicode strings rather than bytes, and multiprocessing.Pool to run 
re.findall in parallel.)

FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters 
for subsets of bases representing partially known bases.  The third task is to 
expand upper case letters to subsets of lowercase letters.  Since the rules 
requires use of re and one substitution at a time, the 2 Python programs run 
re.sub over the current sequence 11 times.  More idiomatic for Python, and 
probably faster, would be to use seq.replace(old,new) instead.  Perhaps even 
more idiomatic and probably faster still, would be to use str.translate, as in 
this reduced example.

>>> table = {ord('B') : '(c|g|t)', ord('D') : '(a|g|t)'}
>>> 'aBcDg'.translate(table)
'a(c|g|t)c(a|g|t)g'

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26436>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26436] Add the regex-dna benchmark

Reply via email to