On Friday 10 July 2009 06:01:27 Steve Tolkin wrote:
> Wow! Thanks Bernardo. The technique you proposed is the most eye-opening
> use of Perl I have seen in a long while. I woke in the middle of the night
> and searched through all my Perl books. None mention this: not Algorithms
> with Perl, or Advanced perl, or Object Oriented Perl. (I do not own the
> Mastering Regular Expressions book. Does that discuss this approach?)
"MRE" (Meal Ready to Eat?) discusses many techniques to build regular
expressions, and has a final 10-page section where Friedl builds a 4724-byte
regex to match email addresses (not at all as easy a task as it may seem at
first sight). That expression is smaller but more complex than a mere
alternation of a few thousand strings. I only have the first edition, so I
don't know if in later ones he may have discussed alternations more
specifically.
If you are into complex grammars (either for structured data or for natural
languages) a common thing is to build them up through intermediate variables,
each containing small regex pieces. That way you can end up with quite complex
regexes that are still manageable.
>
> In my (literally dusty) Perl Cookbook section 6.17 "Expressing AND, OR, and
> NOT in a Single Pattern" shows the use of a simple pattern /ALPHA|BETA/ but
> then p. 198-9 says <quote>
> If you want to see if either of two patterns matches:
> if ($string =~ /pat1/ || $string2 =~ /pat2/ ) {something() }
> In short, use Perl's normal Boolean connectives to combine regular
> expressions, rather than doing it all within a single pattern.</quote>
>
> That advice is misleading, and ought to be revised. Pasting together a
> pattern with pipes is a good idea, and widely known and used. However,
> Bernardo Rechea showed an approach that was totally new to me: pasting
> together a very large set of alternative strings as the input to a pattern.
> This can be the most efficient way to search.
Not so fast. That advice is actually good for versions of Perl < 5.10. They
don't have the automatic trie optimization, so large alternations have
traditionally been very slow. I stumbled into that a few years ago when as
part of a set of rewrite rules, I needed to match a ~10 kword dictionary
against many documents (~1.25 billion words). The whole thing was taking days
to run, so I had to code my own function to convert a list of literals into a
trie-like-regex. That did the trick admirably, and got blazingly fast matches
(it ran in under one hour).
Another thing to keep in mind is that the trie works only for alternations of
literal strings, not of generic patterns. I suppose if you intermix patterns
with literals the engine will break the thing into several tries interleaved
with other patterns, which can still be much more efficient that a pure
alternation.
Also, last night I realized that for my "semi-brute force" approach, I hadn't
used the 'o' modifier to the match operator. Doing that, that approach becomes
~x10 faster, and ~20% faster than the pure regex approach... Interesting, eh?
Using the loop can be useful if your on-match logic is more complex than a
mere "return the matched string", because it's nicer to deal with complex
stuff in the body of the if statement than in a (?{ code }) embedded code
regex.
> If this technique does not
> already have a name I propose that it be called "Rechea's regex".
I'm not sure that a simple use of "join '|'", however gargantuan the resulting
regex, is worthy of a name... But I am sure glad you are so enthusiastic about
it :-D
Bernardo
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm