In perl.git, the branch davem/regex8 has been created
<http://perl5.git.perl.org/perl.git/commitdiff/db5c6a3fbac35d95d3fc1f9735d63a5e23b29be1?hp=0000000000000000000000000000000000000000>
at db5c6a3fbac35d95d3fc1f9735d63a5e23b29be1 (commit)
- Log -----------------------------------------------------------------
commit db5c6a3fbac35d95d3fc1f9735d63a5e23b29be1
Author: David Mitchell <[email protected]>
Date: Fri Dec 16 16:02:59 2016 +0000
POC fast find start class and fingerprint
S_find_byclass() is used to find a suitable starting point to run a regex
match. For example in /(abc|def|ghi)..../, it might search for one of the
three characters [adg].
This commit (which is purely proof of concept and not intended for
production) expands that idea in two ways, taking advantage of the ability
to process 8 bytes at once on 64-bit platforms.
First, rather than checking each byte one by one and indexing into a
bitmap of valid characters in the class, it (for when there are 4 or less
valid characters) reads 8 bytes at a time into a UV then compares 8 bytes
in parallel for each of the valid chars by using a bunch of bit ops. It
should be possible too (although this commit doesn't try) to match a
one or two byte ranges in parallel instead.
Second, once having found a possible starting position based on that byte
satisfying the character class, it uses a fingerprint to quickly check
where the next up-to-8 bytes of the string are likely to match. The idea
here is to work out for each of the 8 starting bytes of the pattern, what
are the valid bit patterns that can occur in each of then. When the
pattern is compiled, two UVs are generated which specify for each of the
64 bit positions, which bits must always be 0, or must always be 1.
Depending on the nature of the pattern, this may allow many start
positions to be quickly rejected.
This commit only implements these two ideas in the trie part of
S_find_byclass(); so in something like
/(abc[defh]hij|vwxyz)..../
it searches 8 bytes at a time for a or v, then uses a fingerprint to see
if the next three bytes are likely to be "abc" or "vwx".
If fully implemented (so examing all possible sequences of start ops, not
just the trie), the fingerprint would be able to check the 5 five
bytes in a single op rather than just the 3 of this commit.
Whether the fingerprint can quickly reject a match position depends on the
nature of the alternatives - an alternation with lots of alternatives will
tend to fill out all bit values as being legal. Even in this case though
it would be able to distinguish between different types of character; for
example /lots|of|words|..../ might end up matching most a-z chars at most
positions, but would still quickly reject of there was an A-Z or 0-9 in
any of the first few bytes.
Using this on the 'regexdna' shootout benchmark which looks for pairs
of DNS sequences in a long ( 50Mbyte) string of nucleotides, I see the
following speedups for particular patterns in a
$cnt++ while $content =~ /$pat/
loop (in millions of CPU cycles, so lower is better):
pre post
12,783 13,609 /a[act]ggtaaa|tttacc[agt]t/
6,663 6,778 /ag[act]gtaaa|tttac[agt]ct/
5,773 5,100 /agg[act]taaa|ttta[agt]cct/
5,288 2,409 /agggtaaa|tttaccct/
This benchmark suffers from the twin problems that
1) the character class search doesn't shine because the source string
mostly just consists of a,c,g,t, so skipping to the next a or t will
typically only skip a couple of characters;
2) because the commit only generates the fingerprint from the trie data
rather then the whole regex, the fingerprint is only 2 or 3 bytes long
in the middle two patterns above. Note that in the final patternm
which has a full 8 bytes fingerprint with few alternates, the match is
more than twice as fast.
Another example is $cnt++ while /peter|jerusalem/ against a string
containing a lc() version of the bible: pre=2.7, post=1.6 Mcycles.
The current implementation doesn't work with tries containing more than
two alternatives, as the trie data (IIUC) is stored in an op rather than
in a separate trie structure, and my hacky code only works with the
latter.
I think the idea has scope for potential.
-----------------------------------------------------------------------
--
Perl5 Master Repository