In perl.git, the branch davem/regex8 has been created

<http://perl5.git.perl.org/perl.git/commitdiff/db5c6a3fbac35d95d3fc1f9735d63a5e23b29be1?hp=0000000000000000000000000000000000000000>

        at  db5c6a3fbac35d95d3fc1f9735d63a5e23b29be1 (commit)

- Log -----------------------------------------------------------------
commit db5c6a3fbac35d95d3fc1f9735d63a5e23b29be1
Author: David Mitchell <[email protected]>
Date:   Fri Dec 16 16:02:59 2016 +0000

    POC fast find start class and fingerprint
    
    S_find_byclass()  is used to find a suitable starting point to run a regex
    match. For example in /(abc|def|ghi)..../, it might search for one of the
    three characters [adg].
    
    This commit (which is purely proof of concept and not intended for
    production) expands that idea in two ways, taking advantage of the ability
    to process 8 bytes at once on 64-bit platforms.
    
    First, rather than checking each byte one by one and indexing into a
    bitmap of valid characters in the class, it (for when there are 4 or less
    valid characters) reads 8 bytes at a time into a UV then compares 8 bytes
    in parallel for each of the valid chars by using a bunch of bit ops. It
    should be possible too (although this commit doesn't try) to match a
    one or two byte ranges in parallel instead.
    
    Second, once having found a possible starting position based on that byte
    satisfying the character class, it uses a fingerprint to quickly check
    where the next up-to-8 bytes of the string are likely to match. The idea
    here is to work out for each of the 8 starting bytes of the pattern, what
    are the valid bit patterns that can occur in each of then. When the
    pattern is compiled, two UVs are generated which specify for each of the
    64 bit positions, which bits must always be 0, or must always be 1.
    
    Depending on the nature of the pattern, this may allow many start
    positions to be quickly rejected.
    
    This commit only implements these two ideas in the trie part of
    S_find_byclass(); so in something like
    
        /(abc[defh]hij|vwxyz)..../
    
    it searches 8 bytes at a time for a or v, then uses a fingerprint to see
    if the next three bytes are likely to be "abc" or "vwx".
    If fully implemented (so examing all possible sequences of start ops, not
    just the trie), the fingerprint would be able to check the 5 five
    bytes in a single op rather than just the 3 of this commit.
    
    Whether the fingerprint can quickly reject a match position depends on the
    nature of the alternatives - an alternation with lots of alternatives will
    tend to fill out all bit values as being legal. Even in this case though
    it would be able to distinguish between different types of character; for
    example /lots|of|words|..../ might end up matching most a-z chars at most
    positions, but would still quickly reject of there was an A-Z or 0-9 in
    any of the first few bytes.
    
    Using this on the 'regexdna' shootout benchmark which looks for pairs
    of DNS sequences in a long ( 50Mbyte) string of nucleotides, I see the
    following speedups for particular patterns in a
        $cnt++ while $content =~ /$pat/
    loop (in millions of CPU cycles, so lower is better):
    
         pre    post
      12,783  13,609 /a[act]ggtaaa|tttacc[agt]t/
       6,663   6,778 /ag[act]gtaaa|tttac[agt]ct/
       5,773   5,100 /agg[act]taaa|ttta[agt]cct/
       5,288   2,409 /agggtaaa|tttaccct/
    
    This benchmark suffers from the twin problems that
    1) the character class search doesn't shine because the source string
       mostly just consists of a,c,g,t, so skipping to the next a or t will
       typically only skip a couple of characters;
    2) because the commit only generates the fingerprint from the trie data
       rather then the whole regex, the fingerprint is only 2 or 3 bytes long
       in the middle two patterns above. Note that in the final patternm
       which has a full 8 bytes fingerprint with few alternates, the match is
       more than twice as fast.
    
    Another example is $cnt++ while /peter|jerusalem/ against a string
    containing a lc() version of the bible: pre=2.7, post=1.6 Mcycles.
    
    The current implementation doesn't work with tries containing more than
    two alternatives, as the trie data (IIUC) is stored in an op rather than
    in a separate trie structure, and my hacky code only works with the
    latter.
    
    I think the idea has scope for potential.
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to