<tolerance-plea>
Upon completing this message, it occurs to me that it is REALLY long and
covers a lot of ground. If the size of this message intimidates or annoys
you (as I imagine it might me), then just consider that this is a good three
or four weeks of stuff that I could have been sending one or two paragraphs
at a time to the list. So in a way, this is a "digest" of all those
messages that I didn't send. </tolerance-plea>
All,
We are a pretty hardcore user of ORO and have been very happy with it.
As I have said in previous messages to the list, we have done significant
performance testing (mostly of ORO versus the Sun JDK 1.4 implementation,
but with others as well). Our opinion based on this testing is that ORO is
approximately twice as fast as Sun (for our use), and nothing else is even
in the ballpark. Our tests use the 580+ Perl regular expressions from
SpamAssassin against a 3k test email message, and while this is certainly
limited, we feel that it probably models our applications use reasonably
well.
The Sun implementation had some additional features that we needed
(like group-local modifiers), but had some Perl compatability problems
(inlcuding handling of unnecessarily backslash-escaped characters inside of
character classes). Since the workaround effort was comparable, and the
performance of ORO was so much better, we decided to stay with ORO.
Even though we are convinced that ORO is the fastest regex solution
available to us, we are still not sure that it is fast enough (we have
specific performance goals, and in our current tests our product is doing a
LOT of stuff in addition to text searching, but still spends around 85% of
its time in ORO). In a previous, somewhat similar, project done in C++ a
few years ago our solution was to build our own high performance DFA-based
regex implementation. Now I'm here at a new company looking at a very
similar problem, and without the DFA regex code, but with a vague memory of
how much work it was to create (a lot) and test (a lot more). So I'd like
to see if there is maybe some way to work to help make ORO faster before we
launch into building our own NFA/DFA hybrid from scratch.
To explain a little, our application has two principle uses of regex
searching where performance is important. The first use is in searching the
plaintext of email messages for the presence of simple expression matches
from a large list. For example, in SEC regulatory compliance we need to
search for about 4,000 stock ticker symbols (these are just literal text,
case insensitive, with a word boundary on either end). The second use is in
SPAM processing using our implementation of SpamAssassin. In this case, we
run a variety of Perl 5.6 regular expressions against various headers,
decoded message plaintext, raw message body, and URI's extracted from the
message body. Many of these expressions are complex, verging on creepy, and
some of them include backreferences. Also note that since we synchronize
this list from an external source periodically, it is not practical to
hand-tune the expressions for performance.
We recently attempted to get better performance using the Awk support
in ORO (I get excited about regex performance whenever I see "DFA"). I
wrote some "generic" compiler and pattern implementations (adapters) that
would use Awk if Awk supported the expression, otherwise fall back to using
perl. This approach initially produced horrible results based on a handful
of degenerate-case DFA's (on the order of 2000x worse in the worst case).
The reason for this, as far as I can tell from looking at the code, is that
the DFA table is not built during compile, but rather constructed as needed
on each match (I could very well be wrong about this, but that's what it
looked like). I modified the generic adapter to only use Awk when there
were no wildcard constructs at all (no *, +, or open-ended quantified
atoms), and I got about a 2x improvement for the Awk expressions (which was
about a 25% improvement overall).
I was starting to investigate making some changes to the Awk support
(building the complete DFA state table during compile, adding support for
word boundaries, etc) when I started noticing other things in the code and
realized that its "8-bit ASCII"-ness was going to be more of a problem than
I had originally thought (I originally though that literal characters in
expressions were limited in this way, not the streams being searched). I
got to the point where modifying Awk to do what I wanted (particularly I18n)
was going to be an effort comparable to writing what I needed from scratch,
and so I was back at my original position. (If I'm wrong, please let me
know).
It seems that there are a couple of obvious ways to attack the general
performance of ORO perl regex: straight-forward performance tuning, and
"re-architecture for performance". The problem for me is that these are the
two exercises that require the most intimate knowledge of the code. While
just about any competent programmer could add a feature or fix a specific
bug, doing the tasks we're talking about here, especially on a code base
that's this complicated, really takes someone who knows the code inside and
out. So I'm trying to find a way to incentivize the right people to do this
work, so as to avoid going in there with a chainsaw, mucking it up, and
crying myself to sleep every night. But it's not that I'm lazy or
unskilled. I'd be happy to do whatever support, analysis, testing,
profiling, etc. would be helpful. And if necessary, I can certainly fire up
the chainsaw with a little guidance.
I am assuming based on the messages I've seen on this list that this
code has been profiled using a profiling tool. If not, that would be an
excellent start. I'd be happy to provide the test data, or even a complete
test harness. We used the Borland OptimizeIt product on our project and had
very good results (we were up and running in literally 10 minutes without
looking at the docs). They have a 15 day free trial, so we downloaded it
and tuned 16 to 18 hours a day for 13 days straight, getting to the point of
completely diminished returns before the trial was up (we will presumably
have to drop the $1,500 the next time we do this, as we have at least one
more round of tuning before we ship). I did spend some time looking at ORO
with OptimizeIt, but not knowing the code, it wasn't really clear how I
should interpret the results or what I might do about them ("gee, it
certainly is spending all it's time recursed 8 or 10 levels into match( )"
was the extent of my analysis).
Another approach would be for me to report specific expressions that I
think perform worse than they "should", based on my understanding of how an
NFA regex works. I'm not sure that addressing these will get to the
performance that we need to meet our benchmark criteria, but it would be
localized, small increments of work (at least to analyze them), and it might
get us there. For example, I have noticed that an expression in the form
"\b(foo|bar)\b" is significantly faster than one in the form
"(\bfoo\b|\bbar\b)", especially as the number or complexity of terms
increases. Clearly in a DFA these expressions are identical, but even with
an NFA I would expect only a small difference (I am seeing more like 2x
slower for the latter case).
Another example is that in searching for lists of expressions we
initially just or'ed them all together, as in: "(exp1|exp2|expn)". This
produced much better results than searching the entire target for each
expression individually, though not as much better as I expected. The real
problem is that when using complex expressions (like the SpamAssassin
expressions), this method actually performed more than twice as slowly as
searching for each term individually, which really surprised me. I suppose
that there must be some arithmetic increase in complexity in joining the
complex expressions together (but clearly, I don't understand and can't
explain this behavior).
So I could start submitting examples like this to bugzilla, but I'm a
little reluctant to do so unless the developers really want them (in part
because a significant number of the responses could very well be in the
form: "hey bozo, the behavior that you observed is consistent with the
complexity of the problem, i.e. the nature of the beast"). Anyway, I guess
I'd appreciate some direction from the ORO brain trust here.
The second major approach would be to look at some significant changes
to the way ORO works in order to produce something that is significantly
faster. This is something others have probably thought a lot about, and are
in a much better position to implement than I am. One idea that I had was
to investigate the feasibility of using subset construction to convert the
NFA into a DFA (either all the time, or perhaps only in those cases where
there are no backreferences, or only until the DFA blows out to some
specified maximum size). While this approach probably wouldn't yield
performance quite as good as a pure performance-oriented DFA, it might very
well be close and does have the big advantage of being incremental (of
course, you never know until you try).
So that is everything I have to say on the matter (finally). I would
appreciate any feedback, specifically in the area of what I can do to help
(preferably without my chainsaw).
Bob Dickinson
Brute Squad Labs, Inc.
http://www.brutesquadlabs.com
--
To unsubscribe, e-mail: <mailto:oro-dev-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:oro-dev-help@;jakarta.apache.org>