RE: Regex bytecode bakeoff

Brent Dax Mon, 06 May 2002 21:19:31 -0700

(FYI, I'm the main author of Parrot's rx package.)

Mark Kvale:
# Warning: this is a long message, a small paper really. The 
# synopsis is that I have created a simple regex compiler with 
# multiple backends to test proposed regex coding schemes. 
# Details and test results are below.
# 
# Hi all,
# 
# One of the upcoming decisions that needs to be made is on the 
# design of the Parrot regex engine. More specifically, (1) how 
# will compiled regexes be represented and (2) how will strings 
# be matched against compiled regexes?
# 
# I am relatively new to p6i, but gather there are at least three
# approaches: 
# 
#    A. Compile regexes down to Parrot bytecode that forms an NFA
#       representation of the regex. Calling the code with a string
#       parameter matches the regex against that string. Compile using
#       only core ops.
# 
#    B. Same scheme as (A), but use specialized rx ops as well as core
#       ops.  
# 
#    C. Create a specialized compiler for regexes that produces a
#       specialized abstract bytecode representing the regex. To match,
#       feed the bytecode and the string to a specialized engine. Thus
#       regexes would be an independent subsystem within parrot. This is
#       the approach perl5 takes.


That pretty much covers it.

# Three metrics one might use to evaluate the possibilities are 
# size of the bytecode representing the regex, execution speed, 
# and ease of design/maintenance of the regex compilation code.
# 
# Regarding ease of design/maintenance, schemes (A) and (B) 
# have the advantage of making regex/perl code integration easy 
# compared to (C). Some people have claimed that perl5 regex 
# code is complex, a mess, and quite difficult to understand, 
# so we should not make the same mistake of implementing a 
# scheme like (C) again. No doubt after 10 years of hacking, 
# the perl5 regex code could use a major house cleaning. But it
# is a mistake to think that a rewrite in parrot bytecode will 
# make the complexity magically disappear. Coding even the 
# simple implementation below is tricky and debugging the flow 
# of a recursive or stack-based iterative finite-state 
# automaton will make your head hurt. Adding code optimizations 
# and metadata to match faster just redoubles that complexity. 
# So creating (A) or (B) will be a lot of hard work and without 
# good design and documentation, the resulting code will be 
# just as difficult to understand as the perl5 code is.

They've tried to rewrite the code, and they always lose tons of speed in
the process.  Perl's regex engine is coded to be fast, which makes it
ugly.  The hope is that a completely different approach will keep us
pretty close to Perl speed without all the cruft of Perl's engine.

# But the results below are about bytecode size and execution speed.
# 
# To compare schemes (A), (B), and (C), I implemented each of 
# them. I created a common parser for the three schemes that 
# recognizes the
# basics: ^, $, ., a, \a, [a], [^a], ab, a|b, a*, a+, a? and 
# (a).  The parsed regex can be converted to one of three types 
# of bytecode: 
# 
#    A. 'reg'  - Parrot bytecode, just core ops
#    B. 'rx'   - Parrot bytecode, core and rx ops
#    C. 'spen' - a specialized bytecode used by the execution engine in
#                Henry Spencer's original regex package. His package
#                formed the basis for perl5 regex engine.
# 
# For the reg backend, I made use of assembly idioms used in 
# Steve Fink's Regex package. For the rx backend I used idioms 
# suggested by the rx pod.  For the spen backend, I generated 
# bytecode that exactly matched Spencer's bytecode, so that I 
# could use his regexec routines directly for execution.  
# Although not entirely bug-free, all three backends pass the 
# match/no-match portion of the Spencer test suite, so they 
# have a basic level of functionality. The code is available at
# 
#    http://keck.ucsf.edu/~kvale/simple_rx.tar.gz
# 
# and is meant to be unpacked in the languages/ directory.
# 
# To level the playing field for the three backends, I opted to 
# include no optimizations whatsoever: no regex rewriting, no 
# 'must match' strings, etc. Each backend is a straight 
# translation of the regex.  I did try to optimize emitted reg 
# and rx bytecode for execution speed and concision. Both 
# parrot and the Spencer code were compiled with -O. 
# We will compare with perl5 as well, which is most definitely 
# not fair, but shows the goals we need to shoot for.
# 
# Here are some relative bytecode sizes (in bytes) for various regexes:
# 
# regex                              reg      rx    spen
# 
# ''                                 236     128      10
# 'a'                                300     172      12
# '(bc+d$|ef*g.|h?i(j|k[^lmnop]))'  1860    1168     114
# 
# Overall, rx code is about 10-15 times as large as spen code 
# and reg code is 1.5-2 times as large as rx code.  The reason 
# that reg and rx codes are so much larger than spen code is 
# that (1) they are complete parrot subroutines, whereas spen 
# code is just a set of instructions to a separate engine, and 
# (2) parrot bytecode is general purpose and thus more verbose 
# than a specialized coding scheme like that for spen.  rx cuts 
# down on code size relative to reg, thanks to specialized 
# instructions combining several tasks into one.
# 
# Execution time is perhaps the most crucial metric.  perl5 
# excels at searching large amounts of text quickly; most users 
# would not want to give that up.
# 
# We'll model the execution time of a regex test as follows:
# 
#    t[i] = co + n * (lo + r[i])
# 
# with
# 
#    t[i] = execution time for the whole test of regex i
#      co = a constant overhead, approximately independent of regex
#       n = number of repetitions of the match
#      lo = an overhead per match, approximately independent of regex
#    r[i] = time taken by just the regex code for a single match
# 
# co is dependent on the test harness used and is 
# uninteresting.  lo takes into account the allocation, 
# initialization and cleanup time needed for any single regex 
# match and is relevant for us; each regex takes at the very 
# least lo time. r[i] is the time taken to do the actual match 
# against a string and is dependent on both the string and the regex.
# 
# We can solve for co by testing against a fixed regex and 
# string, and varying the number fo repetitions. Given co, we 
# can solve for lo by keeping repetitions fixed and varying the 
# size of a repetitive regex and string. For instance, in each 
# of the backends, matching regex 'aaa' against string 'aaa' 
# will take about three times as long as matching regex 'a' 
# against string 'a'.  For a regex-string combination that can 
# be made repetitive, the model becomes
# 
#    t[i] = co + n * (lo + m*ro[i])
# 
# with m the number repetitive elements in the regex and ro[i] 
# the per unit matching time.
# 
# Matching regex 'a' against string 'a' we get the following 
# times on my Lintel notebook: (All times are in seconds)
# 
# "a" =~ /a/:
# 
# backend   lo            ro             lo/ro     ro/spen ro
# 
# reg:      4.518e-06     1.935e-07      23.3      12.8
# rx:       1.621e-06     5.387e-07       3.0      35.6
# spen:     5.477e-07     1.509e-08      36.2       1.0
# perl:     7.897e-07     1.267e-08      62.3       0.84
# 
# The fixed, per match overheard lo is 3 times larger for rx 
# than spen and 8 times larger for reg than spen. The per 
# character match time ro is 36 times larger for rx than for 
# spen and 13 times larger for reg than for spen. That the per 
# character time for rx is so large seems counterintuitive to 
# me. On the other hand, the rx_literal code
# 
# op rx_literal(in pmc, in str, inconst int) {
#         UINTVAL i;
#         RX_dUNPACK($1);
#         if(string_length(rx->string) < rx->index+string_length($2)) {
#                 goto OFFSET($3);
#         }
#         /* This is faster than using substr--it avoids memory 
# allocations. */
#         for(i=0; i < string_length($2); i++) {
#                 if(   string_ord(rx->string, rx->index+(INTVAL)i) 
#                    != string_ord($2, (INTVAL)i)) {
#                         goto OFFSET($3);
#                 }
#         }
#         RxAdvanceX(rx, string_length($2));
#         goto NEXT();
# } 

However, my current design plans for rx would make this faster.
Basically, it would automagically use pointer arithmetic on fixed-width
encodings and do various other dirty tricks to save time.  (To avoid
tons of conditionals, it would use a vtable.)

# computes dynamically several functions, such as length of the 
# regex match string, that are precomputed in the reg code...
# 
# Note that the overhead/match ratio lo/ro is large in the reg, 
# spen and perl routines. In the reg code, I think this is due 
# to the allocation of two PerlArrays for subexpression 
# indices. In the spen and perl code, it is due to data 
# structure initialization and optimization checks.

It may be disabled in the current code, but rx allocates a pair of
PerlArrays too.

# Here are the times for two other atoms:
# 
# "a" =~ /./
#              lo              ro           ro/spen 
# reg:         4.423e-06       2.18e-07     5.4
# rx:          1.522e-06       1.907e-07    4.7
# spen:        4.839e-07       4.063e-08    1.0
# perl:        8.945e-07       2.81e-08     0.7
# 
# The disparities between spen and reg and rx are more modest here.
# 
# "a" =~ /[A-Za-z0-9_]/
#              lo              ro           ro/spen
# reg:         4.389e-06       9.487e-07     3.1
# rx:          1.599e-06       1.184e-05    39.2
# spen:        3.758e-07       3.014e-07     1.0
# perl:        9.922e-07       3.75e-08      0.12
# 
# spen slows down by calling strchr, so reg is only three times 
# slower. Clearly, rx_oneof could be optimized for 
# multi-character classes.

rx_oneof is very, very slow because it compiles a bitmap of the
character class each time, then uses the bitmap for a fast lookup.
There are two opcodes, rx_makebitmap (?) and rx_oneof_bmp, which can be
used to precompile the bitmap.

If you're testing these in a loop (which you almost certainly are),
there's another trick you can use with rx.  rx_clearinfo can be used to
reset all the fields in the regex's info structure without allocating
anything; it might theoretically be possible to code up an entire (real
world) program that only allocated one info structure.  From my
experience coding and optimizing rx, memory allocations are killer.

# For the '+' quantifier, we get
# 
# "a"x1000 =~ /a+/
# 
#        lo + r
# reg:   8.07e-04
# rx:    8.46e-04
# spen:  3.92e-06
# perl:  7.56e-06
# 
# Above we found lo/ro ratios of 3-60, so for a string of 
# length 1000 most of the time is spent in the match, not the 
# overhead.  Both perl and spen do a greedy match using pointer 
# arithmetic in a tight C loop. rx and reg both do a branch and 

Something that wouldn't be possible in a well-behaved Parrot regex
engine--a string's internals are supposed to be its own business.  (Of
course, nobody ever said that the engine had to be well-behaved... :^) )

# a save to stack for each iteration, which slows them down by 
# an amazing factor of 200 relative to spen. This extreme 
# factor is partly due to a deficit in my code; spen 
# differentiates between simple and complex atoms in the 
# presence of * or + and optimizes the execution accordingly. 
# If did the same, I could eliminate a stack save per iteration.
# 
# The following match induces a modest amount of backtracking:
# 
# ("ab"x10000)."c" =~ /abc/
# 
#       r          r/spen
# reg:  1.62e-02   2.2
# rx:   2.52e-02   3.4
# spen: 7.35e-03   1.0
# perl: 2.47e-04   0.034
# 
# Recursion is typically slower than iteration with a stack, so 
# that reg and rx pull within a factor of 2-3 of spen. perl5 
# achieves a factor of 30 speedup by noting that abc is a 
# required substring and searching in an efficient fashion for 
# that string.

I always hated that optimization--it made it really hard to benchmark
against something realistic.

# Let's now look at performance under heavy backtracking. We'll 
# use a 'naughty' regex with nested indeterminate quantifiers, 
# which generate an exponential number of partitions of the 
# string in their search for a match.
# 
# "bbbbXcXaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/
# 
#             r
# reg         6.64
# rx          4.99
# spen        3.89
# perl        1.13e-04
# 
# In this test, rx is only 28% slower than spen. reg and rx 
# have the same flow control, so the advantage in rx must be in 
# its private integer stack, vs the general user stack that reg 
# uses. perl5 uses optimization to bypass exponential blowup 
# and trounce the competition.

The special int stack will become public as soon as I have time to code
it up.

# From the data, we can make several conclusions:
# 
#  1. parrot bytecode is 10-25 times larger than the more specialized
#     Spencer bytecode.

However, it's interesting to note that the Parrot bytecode would be a
quarter of its current size if we were using 8-bit opcodes instead of
32-bit ones.

#  2. Both the reg and rx backends produce code that is generally 3-35
#     times slower that spen code fed into a specialized engine.
# 
#  3. rx ops can help reduce bytecode size, but often run 
# slower than the
#     equivalent core code.

At least partly because they 

#  4. One area in which reg and rx are relatively competitive is
#     backtracking. Iteration, coupled with a stack, provides efficient
#     backtracking relative to a recursive approach.
# 
#  5. We already knew this, but all those optimizations in perl5 are a
#     big win performance-wise over the Spencer engine.
# 
# Caveats:
# 
#  1. I've tried to write short, efficient code for the reg and rx
#     backends. But I am new to Parrot assembly and could have done
#     something seriously suboptimal. I welcome folks looking at and
#     playing with the reg and rx code to see if there are faster idioms
#     that can be machine generated.

More likely, I'm new to regex engines and have never studied finite
automata, so I screwed up something in the design of rx.

#  2. These are early days for parrot optimization, and it may be parrot
#     bytecode based regex engines can narrow the gap against
#     specialized regex engines.  
# 
# In my opinion, however, factors of 3-10 or more in execution 
# time will be hard to overcome. Both the code size and 
# execution time data strongly favor scheme (C), building an 
# independent specialized bytecode engine into parrot.

Have you tried the rx and reg versions with exotic settings for Parrot,
like JIT or prederef?  It would also be interesting (although quite
difficult I'm sure) to modify Spencer's package to use Parrot strings.

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

blink:  Text blinks (alternates between visible and invisible).
Conforming user agents are not required to support this value.
    --The W3C CSS-2 Specification

RE: Regex bytecode bakeoff

Reply via email to