On Friday 10 July 2009 06:01:27 Steve Tolkin wrote:
> Wow! Thanks Bernardo.  The technique you proposed is the most eye-opening
> use of Perl I have seen in a long while.  I woke in the middle of the night
> and searched through all my Perl books.  None mention this: not Algorithms
> with Perl, or Advanced perl, or Object Oriented Perl.  (I do not own the
> Mastering Regular Expressions book. Does that discuss this approach?)

"MRE" (Meal Ready to Eat?) discusses many techniques to build regular 
expressions, and has a final 10-page section where Friedl builds a 4724-byte 
regex to match email addresses (not at all as easy a task as it may seem at 
first sight). That expression is smaller but more complex than a mere 
alternation of a few thousand strings. I only have the first edition, so I 
don't know if in later ones he may have discussed alternations more 
specifically.

If you are into complex grammars (either for structured data or for natural 
languages) a common thing is to build them up through intermediate variables, 
each containing small regex pieces. That way you can end up with quite complex 
regexes that are still manageable.

>
> In my (literally dusty) Perl Cookbook section 6.17 "Expressing AND, OR, and
> NOT in a Single Pattern" shows the use of a simple pattern /ALPHA|BETA/ but
> then p. 198-9 says <quote>
> If you want to see if either of two patterns matches:
>     if ($string =~ /pat1/ || $string2 =~ /pat2/ ) {something() }
> In short, use Perl's normal Boolean connectives to combine regular
> expressions, rather than doing it all within a single pattern.</quote>
>
> That advice is misleading, and ought to be revised.  Pasting together a
> pattern with pipes is a good idea, and widely known and used.  However,
> Bernardo Rechea showed an approach that was totally new to me: pasting
> together a very large set of alternative strings as the input to a pattern.
> This can be the most efficient way to search.

Not so fast. That advice is actually good for versions of Perl < 5.10. They 
don't have the automatic trie optimization, so large alternations have 
traditionally been very slow. I stumbled into that a few years ago when as 
part of a set of rewrite rules, I needed to match a ~10 kword dictionary 
against many documents (~1.25 billion words). The whole thing was taking days 
to run, so I had to code my own function to convert a list of literals into a 
trie-like-regex. That did the trick admirably, and got blazingly fast matches 
(it ran in under one hour).

Another thing to keep in mind is that the trie works only for alternations of 
literal strings, not of generic patterns. I suppose if you intermix patterns 
with literals the engine will break the thing into several tries interleaved 
with other patterns, which can still be much more efficient that a pure 
alternation.

Also, last night I realized that for my "semi-brute force" approach, I hadn't 
used the 'o' modifier to the match operator. Doing that, that approach becomes 
~x10 faster, and ~20% faster than the pure regex approach... Interesting, eh?

Using the loop can be useful if your on-match logic is more complex than a 
mere "return the matched string", because it's nicer to deal with complex 
stuff in the body of the if statement than in a (?{ code }) embedded code 
regex.

> If this technique does not
> already have a name I propose that it be called "Rechea's regex".

I'm not sure that a simple use of "join '|'", however gargantuan the resulting 
regex, is worthy of a name... But I am sure glad you are so enthusiastic about 
it :-D


Bernardo

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to