Re: Interesting little regex

Uri Guttman Thu, 23 Feb 2006 15:59:23 -0800

>>>>> "AY" == Alan Young <[EMAIL PROTECTED]> writes:

  AY> Updated script at bottom.
  AY> On 2/23/06, Uri Guttman <[EMAIL PROTECTED]> wrote:
  AY> $text =~ s{(
  AY> (\b\w+(?:['-]+\w+)*\b)
  >> 
  >> why the multiple ['-] inside the words? could those chars ever begin or
  >> end words? so just [\w'-]+ should be fine there.

  AY> It's possible to have multi-hyphenated words.  I didn't think it was
  AY> worth the time to figure out how to handle that and single apostrophe
  AY> words at the same time.  Besides, I'm not verifying the accuracy of
  AY> the text.

  AY> In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it
  AY> took 40 seconds and found 's and ' as words where the original did
  AY> not.

no wonder it took so long. you matched the null string between each pair
of word boundaries. you need a +, not * there.

  AY> This is the way I understand it:

  AY> (??{<code>}) replaces the regex at the current pos() with the result
  AY> of the <code> block.

  AY> If the the match ($^N) was not in the hash, then it would auto-vivify
  AY> the key and increment it and return (?!) which is a negative lookahead
  AY> on nothing, which always fails so we force it to backtrack and try
  AY> again.

  AY> If the match ( $^N) is in the hash, then it increments the value and
  AY> returns (?=) which is a positive lookahead on nothing, which always
  AY> succeeds so we continue on.

i understand the boolean thing as i said previously. i was asking why
you used it there. i see no reason if all you are doing is word
counting. 

  AY> Changing the regex to

  AY>   1 while $text =~ m{(
  AY>             (\b\w+(?:['-]+\w+)*\b)
  AY>             (?{!$unique{$^N}++})
  AY>            )
  AY>           }xg;

  AY> dropped the time down to 3s.

  >> since you just replace the word by itself, why use s///? m// will get
  >> the same results and should be much faster.

  AY> There was no appreciable difference between the two types of regexes
  AY> (see my code below).

try this:

        $unique{$1}++ while $text =~ m/([\w'-]+)/g ;

use the benchmark module to compare the speeds. make sure you don't do
destructive parsing which some of your examples seem to to.

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org

Re: Interesting little regex

Reply via email to